DSPRelated.com
Forums

FFT and windowing for a newbie

Started by goatman January 3, 2005
Hi there,
Just a quick question from a DSP beginner here. I'm doing a
college project on basic speech recognition, and I have some questions.
I am told to use a 256 point FFT with a hanning window and 50%
overlapping. This then produces a number of FFT windows right? (for
example, in a 512 sample sound file I will have 3 windows). Then I have
to perform an FFT on them. I then need to apply a Dynamic Time Warping
Algorithm (Sakoe&Chiba) to them. my problem is how do i turn my 3
feature vectors into 1? do i simply average them? or sum them? or have
i missed something very important. do i actually need a single feature
vector? what is a feature vector exactly? oh dear oh dear... I hope
that made sense to someone!
many thanks
joe
joewoodhouse30@hotmail.com

Joe,

Are you planning on using the FFT components directly as the feature 
vector? Typically, you'd smooth the spectrum in some manner rather than 
using the spectrum coefficients directly.

You could smooth the spectrum essentially by applying a low-pass filter 
to the spectrum coefficients, which is essentially multiplying the 
"quefrency" domain coefficients with the low pass response. This is 
called "liftering".

Also, usually this liftering is performed on the "cepstrum coefficients" 
rather than on the time domain samples directly. If s represents your 
speech, u the excitation, and h the filter coefficients of your source 
model, then:

s = u * h     * => convolution
F{s) = F(u}F{h}
log F{s) = log F{u} + log F{h}
Cepstrum{s} = Finv{log F{s}} = Finv{log F{u}} + Finv(log F{h}}

So by liftering the cepstrum you'll be shutting out the high frequency 
excitation components and retaining the components that cause the slower 
envelope variations that are more suitable for recognition. You might 
also consider estimating LPC parameters and using a concatenated set of 
LPC+Cepstrum coefficients as your vector for DTW.

Regards,
Ravi

goatman wrote:
> Hi there, > Just a quick question from a DSP beginner here. I'm doing a > college project on basic speech recognition, and I have some questions. > I am told to use a 256 point FFT with a hanning window and 50% > overlapping. This then produces a number of FFT windows right? (for > example, in a 512 sample sound file I will have 3 windows). Then I have > to perform an FFT on them. I then need to apply a Dynamic Time Warping > Algorithm (Sakoe&Chiba) to them. my problem is how do i turn my 3 > feature vectors into 1? do i simply average them? or sum them? or have > i missed something very important. do i actually need a single feature > vector? what is a feature vector exactly? oh dear oh dear... I hope > that made sense to someone! > many thanks > joe > joewoodhouse30@hotmail.com >
> my problem is how do i turn my 3 > feature vectors into 1? do i simply average them? or sum them?
Forgot to answer the averaging question: If you're using the Hanning window and less than 50% overlapping, you can simply sum the overlapping cepstrums. This is because the window has the sum-to-constant property - so the coefficients are properly weighted.
Thanks for the help ravi!
I am told to just use the amplitudes of frequencies bands as my feature
vectors. The cepstrum coeffients etc come in later. At the moment I
have got something working, but it gives extremely poor results (25/160
correct classifications...which is less thatn pure chance!) but from
what you said this could be expected

goatman wrote:
> Thanks for the help ravi! > I am told to just use the amplitudes of frequencies bands as my
feature
> vectors. The cepstrum coeffients etc come in later. At the moment I > have got something working, but it gives extremely poor results
(25/160
> correct classifications...which is less thatn pure chance!) but from > what you said this could be expected
Goatman, if you invert the ouput of your detector, you have a 135/160 detection rate, which is quite good!
Don't forget to take care of the ends of your sample set. You should 
have a half-window at either end for proper weighting - otherwise you'd 
be incorrectly de-emphasising samples at both ends.

- Ravi
> Goatman, if you invert the ouput of your detector, you have a 135/160 > detection rate, which is quite good!
Ah! But this isn't a binary hypothesis test. It's probably more like an M-ary hypothesis test. So that won't work!
in the code assignment we are told to simply extend the last window by
the necessary number of zeros. so we have no half windows at all. will
that mean poor performance? at the moment i have the basic sakoe and
chiba up and running (i think) properly, and i get accuracy of about
28%. that seems rather low to me! although i suppose it is better than
pure chance