Hi there, Just a quick question from a DSP beginner here. I'm doing a college project on basic speech recognition, and I have some questions. I am told to use a 256 point FFT with a hanning window and 50% overlapping. This then produces a number of FFT windows right? (for example, in a 512 sample sound file I will have 3 windows). Then I have to perform an FFT on them. I then need to apply a Dynamic Time Warping Algorithm (Sakoe&Chiba) to them. my problem is how do i turn my 3 feature vectors into 1? do i simply average them? or sum them? or have i missed something very important. do i actually need a single feature vector? what is a feature vector exactly? oh dear oh dear... I hope that made sense to someone! many thanks joe joewoodhouse30@hotmail.com
FFT and windowing for a newbie
Started by ●January 3, 2005
Reply by ●January 4, 20052005-01-04
Joe, Are you planning on using the FFT components directly as the feature vector? Typically, you'd smooth the spectrum in some manner rather than using the spectrum coefficients directly. You could smooth the spectrum essentially by applying a low-pass filter to the spectrum coefficients, which is essentially multiplying the "quefrency" domain coefficients with the low pass response. This is called "liftering". Also, usually this liftering is performed on the "cepstrum coefficients" rather than on the time domain samples directly. If s represents your speech, u the excitation, and h the filter coefficients of your source model, then: s = u * h * => convolution F{s) = F(u}F{h} log F{s) = log F{u} + log F{h} Cepstrum{s} = Finv{log F{s}} = Finv{log F{u}} + Finv(log F{h}} So by liftering the cepstrum you'll be shutting out the high frequency excitation components and retaining the components that cause the slower envelope variations that are more suitable for recognition. You might also consider estimating LPC parameters and using a concatenated set of LPC+Cepstrum coefficients as your vector for DTW. Regards, Ravi goatman wrote:> Hi there, > Just a quick question from a DSP beginner here. I'm doing a > college project on basic speech recognition, and I have some questions. > I am told to use a 256 point FFT with a hanning window and 50% > overlapping. This then produces a number of FFT windows right? (for > example, in a 512 sample sound file I will have 3 windows). Then I have > to perform an FFT on them. I then need to apply a Dynamic Time Warping > Algorithm (Sakoe&Chiba) to them. my problem is how do i turn my 3 > feature vectors into 1? do i simply average them? or sum them? or have > i missed something very important. do i actually need a single feature > vector? what is a feature vector exactly? oh dear oh dear... I hope > that made sense to someone! > many thanks > joe > joewoodhouse30@hotmail.com >
Reply by ●January 4, 20052005-01-04
> my problem is how do i turn my 3 > feature vectors into 1? do i simply average them? or sum them?Forgot to answer the averaging question: If you're using the Hanning window and less than 50% overlapping, you can simply sum the overlapping cepstrums. This is because the window has the sum-to-constant property - so the coefficients are properly weighted.
Reply by ●January 4, 20052005-01-04
Thanks for the help ravi! I am told to just use the amplitudes of frequencies bands as my feature vectors. The cepstrum coeffients etc come in later. At the moment I have got something working, but it gives extremely poor results (25/160 correct classifications...which is less thatn pure chance!) but from what you said this could be expected
Reply by ●January 4, 20052005-01-04
goatman wrote:> Thanks for the help ravi! > I am told to just use the amplitudes of frequencies bands as myfeature> vectors. The cepstrum coeffients etc come in later. At the moment I > have got something working, but it gives extremely poor results(25/160> correct classifications...which is less thatn pure chance!) but from > what you said this could be expectedGoatman, if you invert the ouput of your detector, you have a 135/160 detection rate, which is quite good!
Reply by ●January 5, 20052005-01-05
Don't forget to take care of the ends of your sample set. You should have a half-window at either end for proper weighting - otherwise you'd be incorrectly de-emphasising samples at both ends. - Ravi
Reply by ●January 5, 20052005-01-05
> Goatman, if you invert the ouput of your detector, you have a 135/160 > detection rate, which is quite good!Ah! But this isn't a binary hypothesis test. It's probably more like an M-ary hypothesis test. So that won't work!
Reply by ●January 7, 20052005-01-07
in the code assignment we are told to simply extend the last window by the necessary number of zeros. so we have no half windows at all. will that mean poor performance? at the moment i have the basic sakoe and chiba up and running (i think) properly, and i get accuracy of about 28%. that seems rather low to me! although i suppose it is better than pure chance