DSPRelated.com
Forums

Lossy &/or low data rate speech compression question [VERY L-O--N----G preambele ;]

Started by Richard Owlett March 10, 2004
[ cross post comp.dsp , comp.speech.research , comp.speech.users ]
[ follow up set to comp.dsp ]

Background:
  I'm interested in speech recognition.
  I have *NO* formal background in speech recognition.
  I do not have speech recognition on my system (Win XP Pro )
    There seems to be only one viable choice in Windows
      Dragon Naturally Speaking -- useful version costs too much
    Free ( as in either beer or speech ) presumes Linux and a C
      compiler. I have neither. Learning curves not worthwhile.
  I lurk on speech related newsgroups.
  I'm old enough to have listened to some early attempts at TTS
    ( late 60's and early 70's )

  Some time back I came across a 'museum type' web page chronicling 
various mechanisms ( largely mechanical ) for simulating/generating 
speech. Many dating to ~1890 -> ~1910's. I've lost the URL :{


Observation:

Humans can recognize distorted speech under many negative environments.
Under distortion I include accents and poor computer generated speech.
Under "negative environments" I include background noise and multiple 
speech streams.
Human speech recognition is also *obviously* speaker independent.
Current computer speech recognition seems to be *NONE* of above.

Premise:

Current speech recognition is:
   too heavily oriented towards emulating the vocal tract exactly.
       SECONDARY evidence:
       When users describe problems on comp.speech.users ,
       the two common responses are:
          1. How quiet/ideal is your environment?
          2. What's the fidelity of your sound acquisition?
   is heavily influenced by decades old computing limitations.

The QUESTION
How do humans recognize speech?

*MY* question to group

Can someone point me to a library of DLL's for compression decompression?

I am particularly interested in maximum compression algorithms which 
yield minimally human comprehensible speech.

I am also interested in early TTS.


I wish to determine what is actually important to *HUMAN* speech 
recognition.



[My email address is valid  *but*  with heavy spam filtering ;]
"Richard Owlett" <rowlett@atlascomm.net> wrote in message
news:104v3q960dp0b38@corp.supernews.com...
> [ cross post comp.dsp , comp.speech.research , comp.speech.users ] > [ follow up set to comp.dsp ] > > Observation: > > Human speech recognition is also *obviously* speaker independent.
While this is true to some extent, humans certainly can learn to understand certain speakers better, just as the learning algorithms do. For example, after attending school with quite a few Asians, I find I can better understand English spoken with a thick Asian accent then some of my contemporaries. Another good example is with toddler speech: usually a parent or sibling can understand what a very young speaker is saying while an outside (e.g. babysitter) has no clue. Back when I babysat some kids, I used to ask the older sibling to translate for me what the younger was saying. It worked quite well. I also know next to nothing about speech recognition, but thought I'd throw out these observations, FWIW.
Richard Owlett wrote:

> [ cross post comp.dsp , comp.speech.research , comp.speech.users ] > [ follow up set to comp.dsp ] > > The QUESTION > How do humans recognize speech? > > *MY* question to group > > Can someone point me to a library of DLL's for compression decompression?
You may want to check out my free HawkVoiceDI library. You can listen to sound samples encoded/decoded with some low bit rate codecs at http://www.hawksoft.com/hawkvoice/codecs.shtml
> I am particularly interested in maximum compression algorithms which > yield minimally human comprehensible speech.
The OpenLPC codec 'slurs' the speech as you lower the bit rate below 1800 bps.
> I am also interested in early TTS.
-- Phil Frisbie, Jr. Hawk Software http://www.hawksoft.com
Hello,
i have transfer TTS & SR engines from MS VC6.0 for platforms TMS..54x
& iPAQ Pocket PC.
 About source engines see 'www.sakrament.com'
 I'll glad to speak about mine & your problems.

> [ cross post comp.dsp , comp.speech.research , comp.speech.users ] > [ follow up set to comp.dsp ] > > Background: > I'm interested in speech recognition. > I have *NO* formal background in speech recognition. > I do not have speech recognition on my system (Win XP Pro ) > There seems to be only one viable choice in Windows > Dragon Naturally Speaking -- useful version costs too much > Free ( as in either beer or speech ) presumes Linux and a C > compiler. I have neither. Learning curves not worthwhile. > I lurk on speech related newsgroups. > I'm old enough to have listened to some early attempts at TTS > ( late 60's and early 70's ) > > Some time back I came across a 'museum type' web page chronicling > various mechanisms ( largely mechanical ) for simulating/generating > speech. Many dating to ~1890 -> ~1910's. I've lost the URL :{ > > > Observation: > > Humans can recognize distorted speech under many negative environments. > Under distortion I include accents and poor computer generated speech. > Under "negative environments" I include background noise and multiple > speech streams. > Human speech recognition is also *obviously* speaker independent. > Current computer speech recognition seems to be *NONE* of above. > > Premise: > > Current speech recognition is: > too heavily oriented towards emulating the vocal tract exactly. > SECONDARY evidence: > When users describe problems on comp.speech.users , > the two common responses are: > 1. How quiet/ideal is your environment? > 2. What's the fidelity of your sound acquisition? > is heavily influenced by decades old computing limitations. > > The QUESTION > How do humans recognize speech? > > *MY* question to group > > Can someone point me to a library of DLL's for compression decompression? > > I am particularly interested in maximum compression algorithms which > yield minimally human comprehensible speech. > > I am also interested in early TTS. > > > I wish to determine what is actually important to *HUMAN* speech > recognition. > > > > [My email address is valid *but* with heavy spam filtering ;]
In article <104v3q960dp0b38@corp.supernews.com>, Richard Owlett
<rowlett@atlascomm.net> writes
> I do not have speech recognition on my system (Win XP Pro ) > There seems to be only one viable choice in Windows > Dragon Naturally Speaking -- useful version costs too much > Free ( as in either beer or speech ) presumes Linux and a C > compiler. I have neither. Learning curves not worthwhile.
Shocking laziness! Usually you only need to type configure and make.
>Observation: >Humans can recognize distorted speech under many negative environments. >Under distortion I include accents and poor computer generated speech. >Under "negative environments" I include background noise and multiple >speech streams. >Human speech recognition is also *obviously* speaker independent. >Current computer speech recognition seems to be *NONE* of above.
It is, but varies by degrees. As a counter-example speaker identification performance is better performed by machines than people, as we're optimised to pick up changes in speech. And as for understanding accents it takes humans time to tune into a different accent - you dont just understand it immediately.
>Premise: >Current speech recognition is: > too heavily oriented towards emulating the vocal tract exactly. > SECONDARY evidence: > When users describe problems on comp.speech.users , > the two common responses are: > 1. How quiet/ideal is your environment? > 2. What's the fidelity of your sound acquisition? > is heavily influenced by decades old computing limitations.
>The QUESTION >How do humans recognize speech?
Who knows? You can follow the trail from the auditory end into the brain but what follows after that is anyones guess. Perceptually people tend to have phenomenal pattern matching ability, in both the visual and auditory domains. Computers don't. Generally people do not perform speech recognition, but they understand ideas and intent via language. We're not really interested in the words but the ideas behind them. Thats not really how machines work. One other point is that due to our faculties we can predict what is going to be said due to syntax, semantic and domain knowledge to a fair greater extent than using a language model.
>I wish to determine what is actually important to *HUMAN* speech >recognition.
Good luck - at the end of the day the ear and the microphone get to see the same wibblies. John Openshaw
Phil Frisbie, Jr. wrote:
> Richard Owlett wrote: > >> [ cross post comp.dsp , comp.speech.research , comp.speech.users ] >> [ follow up set to comp.dsp ] >> >> The QUESTION >> How do humans recognize speech? >> >> *MY* question to group >> >> Can someone point me to a library of DLL's for compression decompression? > > > You may want to check out my free HawkVoiceDI library. You can listen to > sound samples encoded/decoded with some low bit rate codecs at > http://www.hawksoft.com/hawkvoice/codecs.shtml >
That seems to be what I'm looking for. Following some of your off site links indicates some reading I must do ;}
>> I am particularly interested in maximum compression algorithms which >> yield minimally human comprehensible speech. > > > The OpenLPC codec 'slurs' the speech as you lower the bit rate below > 1800 bps. > >> I am also interested in early TTS. > >
Vic wrote:

> Hello, > i have transfer TTS & SR engines from MS VC6.0 for platforms TMS..54x > & iPAQ Pocket PC. > About source engines see 'www.sakrament.com' > I'll glad to speak about mine & your problems. >
Thanks for the link. A reply on comp.dsp pointed me to http://www.hawksoft.com/hawkvoice/codecs.shtml . It seems to be just what I was looking for. It also has some informative links.
In article <104v3q960dp0b38@corp.supernews.com>,
Richard Owlett  <rowlett@atlascomm.net> wrote:
>The QUESTION >How do humans recognize speech?
Do they? I seem to recall reading about some experiments where some percentage of words were randomly altered in some human composed text. When the text was read, what was reported as heard was often not what was said. Anyone have a reference to these experiments? Should a TTS system report what was actually said, or what the speaker meant to say? (posting from comp.dsp) IMHO. YMMV. -- Ron Nicholson rhn AT nicholson DOT com http://www.nicholson.com/rhn/ #include <canonical.disclaimer> // only my own opinions, etc.
John Openshaw wrote:
> In article <104v3q960dp0b38@corp.supernews.com>, Richard Owlett > <rowlett@atlascomm.net> writes > >> I do not have speech recognition on my system (Win XP Pro ) >> There seems to be only one viable choice in Windows >> Dragon Naturally Speaking -- useful version costs too much >> Free ( as in either beer or speech ) presumes Linux and a C >> compiler. I have neither. Learning curves not worthwhile. > > Shocking laziness! Usually you only need to type configure and make.
I prefer to call it careful allocation of scarce resources ;} Seriously most of the "free" stuff seems to be being developed under *nix. For OT reasons I operate solely in a Windoze environment. Which means I'd have to operate under a Cywin (sp?) layer. Just gets to messy.
> >>Observation: >>Humans can recognize distorted speech under many negative environments. >>Under distortion I include accents and poor computer generated speech. >>Under "negative environments" I include background noise and multiple >>speech streams. >>Human speech recognition is also *obviously* speaker independent. >>Current computer speech recognition seems to be *NONE* of above. > > > It is, but varies by degrees.
I'm not sure. But then again I'm only going on second hand data, requests for assistance on comp.speech.users . Summary of advice given seems to be "user must adapt to software."
> As a counter-example speaker > identification performance is better performed by machines than people, > as we're optimised to pick up changes in speech.
Is that relevant? I had the general impression that person identification picked out features which were characteristic of the physical structure/dimensions of the vocal track.
> And as for understanding accents it takes humans time to tune into a > different accent - you dont just understand it immediately.
Which implys to me that there is a "common feature" to be heard. But we are usually careless/untrained listeners. I vividly remember a demonstration in an undergrad introductory linguistics course I took 40+ years ago. The class was primarily Yanks and two fellows from somewhere deep in Dixie. The prof had the the two southerners say "pin" and "pen". We Yanks could not tell the difference. They and the prof had no problem. I was also used as an example -- seems I have a vowel sound that locates my home to within ~20 miles.
>>Premise: >>Current speech recognition is: >> too heavily oriented towards emulating the vocal tract exactly. >> SECONDARY evidence: >> When users describe problems on comp.speech.users , >> the two common responses are: >> 1. How quiet/ideal is your environment? >> 2. What's the fidelity of your sound acquisition? >> is heavily influenced by decades old computing limitations. > > >>The QUESTION >>How do humans recognize speech? > > > Who knows? You can follow the trail from the auditory end into the brain > but what follows after that is anyones guess. Perceptually people tend > to have phenomenal pattern matching ability, in both the visual and > auditory domains. Computers don't.
Agreed
> Generally people do not perform speech recognition, but they understand > ideas and intent via language. We're not really interested in the words > but the ideas behind them. Thats not really how machines work.
Ahhh. This may be the crux of the issue. We seem to have different definitions of "speech recognition". Perhaps a better term for what I'm interested in would be "phoneme/allophone recognition/discrimination". I'm also biased toward "discrete" rather than "continuous" recognition. I've gotten the impression that programs such as DNS are so strongly oriented toward "continuous speech" that its accuracy suffers if given one word at a time.
> One other point is that due to our faculties we can predict what is > going to be said due to syntax, semantic and domain knowledge to a fair > greater extent than using a language model.
We seem to be speaking of different aspects of the problem.
>>I wish to determine what is actually important to *HUMAN* speech >>recognition. > > Good luck - at the end of the day the ear and the microphone get to see > the same wibblies.
One of the advantages of being an "amateur" [ as in its derivation ;]. I don't have to produce and I can follow any rabbit trail that becomes interesting.
> John Openshaw
Doing a Google search to see what you had previously written in comp.dsp and comp.speech.research led to some interesting side paths. [ I initially searched to see how I should take being described as 'lazy' ,) Found that you like people to do their homework. BTW, you once told me the WEB wasn't only source for information. Agreed, but try getting technical books from a rural library system. Major issues for a 150 mile radius have been effluent ( solid, liquid AND gaseous ) of cattle, hogs, and chickens ;/ ] One thread had points similar to your comments on accents. One writer states that many find older low fidelity computer generated speech to be more intelligible than some more modern "natural sounding" voices. This, and what I saw at a 'museum type' web page chronicling various mechanisms ( largely mechanical ) for simulating/generating speech circa 1890/1910, reinforces my "gut feel" that the current emphasis on high input sound fidelity indicates something is skew somewhere. Another thread touched on the problem of a noisy environment. A statement was made that if certain models were trained in a very quiet environment, their accuracy went way down if only a slight amount of noise were present in the recognition environment. This seems to be opposite of how whatever method humans use operates. Am I far out in left field, or just a run of mill neophyte?
"Richard Owlett" <rowlett@atlascomm.net> schrieb
> >> Free ( as in either beer or speech ) presumes Linux and a C > >> compiler. I have neither. Learning curves not worthwhile. > > > > Shocking laziness! Usually you only need to type configure and > > make. > > > I prefer to call it careful allocation of scarce resources ;} > Seriously most of the "free" stuff seems to be being developed > under *nix. For OT reasons I operate solely in a Windoze > environment. Which means I'd have to operate under a Cywin (sp?) > layer. Just gets to messy. >
If there is really that much "free" stuff under *nix around, you might want to consider Linux. It's easy. Get a CD-ROM distribution (such as e.g. Knoppix, at www.knoppix.org) to try it out. Often they're also available at newsstands. There's no installation on harddisk involved, if you don't want to, so you don't mess up your XP. A C compiler is often included, and as John pointed out, creating a program is as easy as: 1. download the sources 2. unpack $ cd /to/where/the/source/is $ ./configure $ ./make and your program is (normally) ready to run. Do try it out, you'll be surprised how easy it is. HTH. YMMV. Martin Blume