DSPRelated.com
Forums

As "Nyquist" is to "sample rate" "????" is to "sample period/duration/width/?" ?

Started by Richard Owlett September 19, 2004
I'm interested in speech signals as input to speech recognition software.

I get the impression that minimum acceptable sample rates begin at 8 kHz 
( or above ). I assume this is based on which formants are considered 
"significant". I have somewhat arbitrally chosen 44.1 kHz. The data I 
have available is a studio quality CD.

 From another thread, I assume that some characteristic time of a 
phoneme is somewhere between .01 and .1 seconds (+- xx %).

Assuming whatever analysis I do is based on samples of width mm seconds 
taken every nn seconds ( nn presumed < mm ) what are appropriate values 
from a DSP point of view.

[ For perspective see my previous thread titled 'Low freq "analog" of 
Nyquist? ( possibly naive question )' . I'm hoping I've learned enough 
to better phrase my question ]

My ultimate goal is to reduce dependence of speech recognition's 
accuracy on "good mikes" and "good acoustic environment'. Primarily the 
later.

[ for those of you old enough, "this ram keeps butting the dam" ]
Richard Owlett wrote:

> I'm interested in speech signals as input to speech recognition software. > > I get the impression that minimum acceptable sample rates begin at 8 kHz > ( or above ). I assume this is based on which formants are considered > "significant". I have somewhat arbitrally chosen 44.1 kHz. The data I > have available is a studio quality CD. > > From another thread, I assume that some characteristic time of a > phoneme is somewhere between .01 and .1 seconds (+- xx %). > > Assuming whatever analysis I do is based on samples of width mm seconds > taken every nn seconds ( nn presumed < mm ) what are appropriate values > from a DSP point of view. > > [ For perspective see my previous thread titled 'Low freq "analog" of > Nyquist? ( possibly naive question )' . I'm hoping I've learned enough > to better phrase my question ] > > My ultimate goal is to reduce dependence of speech recognition's > accuracy on "good mikes" and "good acoustic environment'. Primarily the > later. > > [ for those of you old enough, "this ram keeps butting the dam" ]
As "Nyquist" is to "sample rate", "frequency resolution" is to "sample set duration". -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On Sun, 19 Sep 2004 14:53:26 -0500, Richard Owlett
<rowlett@atlascomm.net> wrote:

>I'm interested in speech signals as input to speech recognition software. > >I get the impression that minimum acceptable sample rates begin at 8 kHz >( or above ). I assume this is based on which formants are considered >"significant". I have somewhat arbitrally chosen 44.1 kHz. The data I >have available is a studio quality CD. > > From another thread, I assume that some characteristic time of a >phoneme is somewhere between .01 and .1 seconds (+- xx %). > >Assuming whatever analysis I do is based on samples of width mm seconds >taken every nn seconds ( nn presumed < mm ) what are appropriate values >from a DSP point of view. > >[ For perspective see my previous thread titled 'Low freq "analog" of >Nyquist? ( possibly naive question )' . I'm hoping I've learned enough >to better phrase my question ] > >My ultimate goal is to reduce dependence of speech recognition's >accuracy on "good mikes" and "good acoustic environment'. Primarily the >later. > >[ for those of you old enough, "this ram keeps butting the dam" ]
Hi, I'm responding to the Subject text; that is I'm responding to the words: "Nyquist" is to "sample rate". Please know that I have no clue whatsoever as to the meaning of that single word "Nyquist". However, I do have a rough notion of the meaning of the two words "sample rate". I'm not in the audio business, but here's what I've heard. In telephones, the microphone signal is filtered so its frequency bandwidth is just less than 4 kHz. Then that analog signal is digitized at a sample rate of 8 kHz, which satisfies the "Nyquist Criteron". Prepare for rant: I don't think people should use the phrase "Nyquist frequency". That phrase means different things to different people, and this leads to confusion. I think we should use the phrase "sample rate" when we mean the "sample rate" and we should use the phrase "half the sample rate" when we mean "half the sample rate" Simple!! Back to sampling human speech: As it turns out, for good fidelity a human voice signal should have a wider bandwidth than 4 kHz. But to reduce the cost of telephone systems (so they can process as many simultaneous speech signals as possible) early telephone designers realized that you could limit a human speech signal to a bandwidth as low as (roughly) 4 kHz and people (their brains) could still understand the speech signal. Audio fanatics know that human hearing goes up to (roughly) 18-20 kHz, so they want their systems to cover that full frequency range in their "high-fidelity" audio systems. Well, if you have an analog signal whose bandwidth is 20 kHz, then your A/D sample rate must be greater than twice that frequency (Nyquist Criterion, again) which leads to the "studio quality" sample rate of 44.1 kHz. Sorry I can't be of more help. I wouldn't know a "formant", or a "phoneme", if I found one dead in my lunchbox. [-Rick-]
"Rick Lyons" <r.lyons@_BOGUS_ieee.org> wrote in message
news:414ed4b6.394648093@news.sf.sbcglobal.net...
> On Sun, 19 Sep 2004 14:53:26 -0500, Richard Owlett > <rowlett@atlascomm.net> wrote: > > Back to sampling human speech: As it turns out, > for good fidelity a human voice signal should > have a wider bandwidth than 4 kHz. But to reduce the > cost of telephone systems (so they can process as many > simultaneous speech signals as possible) early > telephone designers realized that you could limit a > human speech signal to a bandwidth as low as > (roughly) 4 kHz and people (their brains) could > still understand the speech signal.
Right. On the phone, it is generally quite easy to understand normal conversation speech even with the limited frequency response. However, if someone tries to read a string of random letters, it is quite a bit more difficult to understand them on the other end. Losing those high frequencies makes consonants difficult to differentiate. The brain normally does a good job of compensating for the loss of high frequencies by using context clues. But since very few context clues exist with a string of random letters, it becomes difficult to understand. So saying that a 4 kHz bandwidth is adequate for speech is a bit misleading. Consonant sounds have some frequency content up to close to 20kHz, though there is limited benefit to increasing to anything more than 10kHz IMO.
"Richard Owlett" schrieb
> As "Nyquist" is to "sample rate" > "????" is to "sample period/duration/width/?" ?
As "sample rate" is "1 / (sample period)" "sample period" is to "1/Nyquist" This may be answer to your question, but not of much help. I think you are mixing up two domains here: the one of strict mathematics and signal processing and the other one - much fuzzier - about the human perception of hearing and the generation of speech. While human hearing is obviously based on the same mathematics and physics of acoustics, there are many tricks that evolution has come up with. You might want to check the "Scientist's and Engineer's Guide to Digital Signal Processing": http://www.analog.com/processors/resources/technicalLibrary/manuals/ training/materials/pdf/dsp_book_frontmat.pdf especially chapter 22, "Audio Processing". HTH Martin
On Mon, 20 Sep 2004 10:30:42 -0700, "Jon Harris"
<goldentully@hotmail.com> wrote:

>"Rick Lyons" <r.lyons@_BOGUS_ieee.org> wrote in message >news:414ed4b6.394648093@news.sf.sbcglobal.net... >> On Sun, 19 Sep 2004 14:53:26 -0500, Richard Owlett >> <rowlett@atlascomm.net> wrote: >> >> Back to sampling human speech: As it turns out, >> for good fidelity a human voice signal should >> have a wider bandwidth than 4 kHz. But to reduce the >> cost of telephone systems (so they can process as many >> simultaneous speech signals as possible) early >> telephone designers realized that you could limit a >> human speech signal to a bandwidth as low as >> (roughly) 4 kHz and people (their brains) could >> still understand the speech signal. > >Right. On the phone, it is generally quite easy to understand normal >conversation speech even with the limited frequency response. However, if >someone tries to read a string of random letters, it is quite a bit more >difficult to understand them on the other end. Losing those high frequencies >makes consonants difficult to differentiate. The brain normally does a good job >of compensating for the loss of high frequencies by using context clues. But >since very few context clues exist with a string of random letters, it becomes >difficult to understand. > >So saying that a 4 kHz bandwidth is adequate for speech is a bit misleading. >Consonant sounds have some frequency content up to close to 20kHz, though there >is limited benefit to increasing to anything more than 10kHz IMO.
Yes yes. You're right! I hadn't thought about the consonants. That's why, over the phone to say "FFT", we'd say "foxtrot" "foxtrot" "tango". [-Rick-]
Rick Lyons wrote:
> That's why, over the phone to say "FFT", > we'd say "foxtrot" "foxtrot" "tango". >
When I was at Raytheon, we had an operator/receptionist who made up her own phonetic alphabet. She used it to announce license plate numbers when a driver forgot to turn off the headlights. She generally made up her phonetic alphabet on the spot as needed. My favorite was "F as in Fun. L as in Love. And N as in... NEVER!" She sounded a lot like Aretha Franklin in the Blues Brothers. One day she paged a license plate by saying "Y as in You." That threw everyone for a loop, because we all heard it as "Y as in U." She inspired my coworkers and I to formualate a phonetic alphabet whose purpose was to obfuscate rather than clarify. We favored the names of letters, homophones that start with different letters (gnu, knew, new), names that didn't add information (T as in tea), or words that sound like they start with a different letter than they really do. A as in aye B as in bdellium C as in cue D as in Djibouti E as in eye F as in Fun (a nod to our operator) G as in gnu H as in hour I as in inn J as in jalapeno K as in knew L as in llama M as in Mneumonic N as in new O as in ofal P as in pea Q as in Quay R as in ... never found a good one for R S as in sea T as in tea U as in ... oops. forgot that one V as in vee W as in why Y as in you Z as in zee (or zed) -- Jim Thomas Principal Applications Engineer Bittware, Inc jthomas@bittware.com http://www.bittware.com (603) 226-0404 x536 Nothing is ever so bad that it can't get worse. - Calvin
Cute.   I've done similar things, and I like that you overloaded the
"new" and "eye" sounds, which completely defeats the purpose of a
phonetic alphabet.  ;)

Overloading similar sounds works, too, like B = boy and T = toy.   A
low SNR connection creates ambiguities.   So I used to work on rhyming
phonetic alphabets that were similarly useless.

I think you cheated on V and Z, though.


On Tue, 21 Sep 2004 09:32:36 -0400, Jim Thomas <jthomas@bittware.com>
wrote:

>Rick Lyons wrote: >> That's why, over the phone to say "FFT", >> we'd say "foxtrot" "foxtrot" "tango". >> > >When I was at Raytheon, we had an operator/receptionist who made up her own >phonetic alphabet. She used it to announce license plate numbers when a driver >forgot to turn off the headlights. > >She generally made up her phonetic alphabet on the spot as needed. My favorite >was "F as in Fun. L as in Love. And N as in... NEVER!" She sounded a lot like >Aretha Franklin in the Blues Brothers. > >One day she paged a license plate by saying "Y as in You." That threw everyone >for a loop, because we all heard it as "Y as in U." > >She inspired my coworkers and I to formualate a phonetic alphabet whose purpose >was to obfuscate rather than clarify. We favored the names of letters, >homophones that start with different letters (gnu, knew, new), names that didn't >add information (T as in tea), or words that sound like they start with a >different letter than they really do. > >A as in aye >B as in bdellium >C as in cue >D as in Djibouti >E as in eye >F as in Fun (a nod to our operator) >G as in gnu >H as in hour >I as in inn >J as in jalapeno >K as in knew >L as in llama >M as in Mneumonic >N as in new >O as in ofal >P as in pea >Q as in Quay >R as in ... never found a good one for R >S as in sea >T as in tea >U as in ... oops. forgot that one >V as in vee >W as in why >Y as in you >Z as in zee (or zed) > >-- >Jim Thomas Principal Applications Engineer Bittware, Inc >jthomas@bittware.com http://www.bittware.com (603) 226-0404 x536 >Nothing is ever so bad that it can't get worse. - Calvin
Eric Jacobsen Minister of Algorithms, Intel Corp. My opinions may not be Intel's opinions. http://www.ericjacobsen.org
One time, I overhead someone spelling something over the phone saying "C as at
cat, M as in mat, and B as in bat".  I got a good chuckle out of that, as did
they when I explained how the phonetics chosen didn't really help much!  :-)

"Eric Jacobsen" <eric.jacobsen@ieee.org> wrote in message
news:415043c5.502019890@news.west.cox.net...
> Cute. I've done similar things, and I like that you overloaded the > "new" and "eye" sounds, which completely defeats the purpose of a > phonetic alphabet. ;) > > Overloading similar sounds works, too, like B = boy and T = toy. A > low SNR connection creates ambiguities. So I used to work on rhyming > phonetic alphabets that were similarly useless.
"Rick Lyons" <r.lyons@_BOGUS_ieee.org> wrote in message
news:414ffd9c.470654359@news.sf.sbcglobal.net...
> On Mon, 20 Sep 2004 10:30:42 -0700, "Jon Harris" > <goldentully@hotmail.com> wrote: > > >"Rick Lyons" <r.lyons@_BOGUS_ieee.org> wrote in message > >news:414ed4b6.394648093@news.sf.sbcglobal.net... > >> On Sun, 19 Sep 2004 14:53:26 -0500, Richard Owlett > >> <rowlett@atlascomm.net> wrote: > >> > >> Back to sampling human speech: As it turns out, > >> for good fidelity a human voice signal should > >> have a wider bandwidth than 4 kHz. But to reduce the > >> cost of telephone systems (so they can process as many > >> simultaneous speech signals as possible) early > >> telephone designers realized that you could limit a > >> human speech signal to a bandwidth as low as > >> (roughly) 4 kHz and people (their brains) could > >> still understand the speech signal. > > > >Right. On the phone, it is generally quite easy to understand normal > >conversation speech even with the limited frequency response. However, if > >someone tries to read a string of random letters, it is quite a bit more > >difficult to understand them on the other end. Losing those high frequencies > >makes consonants difficult to differentiate. The brain normally does a good
job
> >of compensating for the loss of high frequencies by using context clues. But > >since very few context clues exist with a string of random letters, it
becomes
> >difficult to understand. > > > >So saying that a 4 kHz bandwidth is adequate for speech is a bit misleading. > >Consonant sounds have some frequency content up to close to 20kHz, though
there
> >is limited benefit to increasing to anything more than 10kHz IMO. > > Yes yes. You're right! I hadn't thought > about the consonants. > > That's why, over the phone to say "FFT", > we'd say "foxtrot" "foxtrot" "tango".
Exactly! The military phonetic alphabet is designed to minimize ambiguity with a poor quality communication link (unlike the fun ones we've been posting here).