DSPRelated.com
Forums

Vowel recognizer using FFTW

Started by acid...@inwind March 21, 2006
Hi! I must build a vowel recognizer using the library FFTW:
analyzing a .wav file, I must retrieve the fundamental and the armonics,
then compare these with fundamental and armonics of other .wav files
previously archived to choose the vowel "most" closeness.

i followed these steps:
- first, I load the samples from the .wav file into an array of
fftw_complex, using 0.0 as imaginary parts;
- then, perform a c2c DFT using FFTW_ESTIMATE as flag; the length of the
DFT is the number of samples (say NS) in the .wav file (in general, this
number ISN'T power of 2);
- last, i've got an array of fftw_complex; the length of the array is NS.

now, I must retrieve the fundamental and the armonics from this array.
how can I interpretate the values of the array? i've read the manual of
FFTW, but the problem is still unresolved.

thanks in advance
gianluca

PS: I apologize for my bad english, I'm italian...


acid_burn@inwind wrote:
> Hi! I must build a vowel recognizer using the library FFTW: > analyzing a .wav file, I must retrieve the fundamental and the armonics, > then compare these with fundamental and armonics of other .wav files > previously archived to choose the vowel "most" closeness. > > i followed these steps: > - first, I load the samples from the .wav file into an array of > fftw_complex, using 0.0 as imaginary parts; > - then, perform a c2c DFT using FFTW_ESTIMATE as flag; the length of the > DFT is the number of samples (say NS) in the .wav file (in general, this > number ISN'T power of 2); > - last, i've got an array of fftw_complex; the length of the array is NS. > > now, I must retrieve the fundamental and the armonics from this array. > how can I interpretate the values of the array? i've read the manual of > FFTW, but the problem is still unresolved.
Hmmm... I don't have much experience with speech processing, but it seems your general approach is a bit unconventional. It seems to me that most people use some sorty of LPC approach. Anyway, let's have a look at your problem: First, the problem consits of two parts: Pitch and wovel. It is not reasonable to expect that a given wovel at a high pitch should compare well with the same wovel at a lower pitch, so the first task would be to normalize the signal spectrum. One could, for instance, use some sort of AM scheme to modulat the detected pitch to some normalized refernce pitch. Once that is done, you might try to compare the normalized spectrum with the different (normalized) refernce spectra, and see what fits best.
> PS: I apologize for my bad english, I'm italian...
Don't worry. Your English is WAY better than my Italian... I spent three monts in Italy a few years ago. At the end, my Italian was *just* sufficient for me making my way in and out of a resturant... Rune
Hi!

>the signal spectrum. One could, for instance, use some sort of AM >scheme >to modulat the detected pitch to some normalized refernce pitch.
what does "AM scheme" means? is there any library to perform the normalization with this method? con i do this with FFTW? the FFTW's manual says: "Note also that we use the standard &#4294967295;in-order&#4294967295; output ordering&#4294967295;the k-th output corresponds to the frequency k/n (or k/T, where T is your total sampling period)." what does it means? take an example - i have an array like that: 0 1 2 3 4 5 <-- Indexes (n=6) 10 12 7 15 1 8 <-- Data the array is just an example and it isn't symmetric. now, excluding a[0] (DC Amplitude) and a[3] (Nyquist Amplitude), 12 is the level of the frequency 1/6? 7 is the level of the frequency 2/6? 1 the level of the frequency 4/6? isn't right? how to find the fundamental and the armonics from this array? what is DC Amplitude and Nyquist Amplitude? thank in advance gianluca
acid_burn@inwind wrote:
> Hi! > > >the signal spectrum. One could, for instance, use some sort of AM > >scheme > >to modulat the detected pitch to some normalized refernce pitch. > > what does "AM scheme" means? is there any library to perform the > normalization with this method? con i do this with FFTW?
"AM" means "amplitude modulator."
> the FFTW's manual says: > "Note also that we use the standard "in-order" output ordering-the k-th > output corresponds to the frequency k/n (or k/T, where T is your total > sampling period)." > > what does it means? > > take an example - i have an array like that: > > 0 1 2 3 4 5 <-- Indexes (n=6) > 10 12 7 15 1 8 <-- Data > > the array is just an example and it isn't symmetric. now, excluding a[0] > (DC Amplitude) and a[3] (Nyquist Amplitude), 12 is the level of the > frequency 1/6? 7 is the level of the frequency 2/6? 1 the level of the > frequency 4/6? isn't right? how to find the fundamental and the armonics > from this array? what is DC Amplitude and Nyquist Amplitude?
It seems you could benefit from reading a text on DSP. Try Lyons: Understanding Digital Signal Processing Prentice-Hall, 2004. It will answer most of your question in this post. Rune
Rune Allnor wrote:
> acid_burn@inwind wrote: >> Hi! I must build a vowel recognizer using the library FFTW: >> analyzing a .wav file, I must retrieve the fundamental and the armonics, >> then compare these with fundamental and armonics of other .wav files >> previously archived to choose the vowel "most" closeness. >.. > First, the problem consits of two parts: Pitch and wovel. It is not > reasonable > to expect that a given wovel at a high pitch should compare well with > the > same wovel at a lower pitch, so the first task would be to normalize > the signal spectrum. One could, for instance, use some sort of AM > scheme > to modulat the detected pitch to some normalized refernce pitch.
This sounds unnecessary. The primary task in vowel recognition is to extract and identify the vocal formants which in turn make up the spectral envelope, all of which is independent of pitch. For an illustration see e.g.: http://hyperphysics.phy-astr.gsu.edu/hbase/music/vowel.html Extracting a spectral envelope is in effect a low-pass filtering process on a frame of FFT amplitudes (I am used to thinking in terms of the phase vocoder, so these are the amplitudes calculated with "hypot()" from the raw complex output of the FFT), to find the overall shape of the spectrum, and indeed to ignore small-scale deviations representing individual partials. Many vowels are dipthongs, and (for speech especially) are characterised by pitch rises or falls, so one does need to extract the pitch trajectory from the sound as well to identify these. Finding the fundamental is sufficient; but one may prefer to derive this from detected harmonics as FFT resolution is typically better "up there". This in turn implies that one needs to detect the actual (or relative) pitch of a vowel combination, and not to normalise everything to a single reference pitch. In any case, the database of vowel format frequencies is independent of the spoken/sung pitch. Richard Dobson.
Hi!

>It seems you could benefit from reading a text on DSP. Try
this is right, i understand my lacks in dsp theories, but i've no time to read entirely a book... i need to simply undestand how to interpretate the FFTW's output array and how to extract from there the fundamental and the armonics i've seen that trasforming all the samples in one step, i have in output a wave that has all frequencies near the 0-frequency. maybe must to extract a little subset of pitchs from the recorded wave? thanks in advance gianluca
Hi! 

>Extracting a spectral envelope is in effect a low-pass filtering process
>on a frame of FFT amplitudes (I am used to thinking in terms of the >phase vocoder, so these are the amplitudes calculated with "hypot()" >from the raw complex output of the FFT), to find the overall shape of >the spectrum, and indeed to ignore small-scale deviations representing >individual partials. > >Many vowels are dipthongs, and (for speech especially) are >characterised by pitch rises or falls, so one does need to extract the >pitch trajectory from the sound as well to identify these. Finding the >fundamental is sufficient; but one may prefer to derive this from >detected harmonics as FFT resolution is typically better "up there". >This in turn implies that one needs to detect the actual (or relative) >pitch of a vowel combination, and not to normalise everything to a >single reference pitch. In any case, the database of vowel format >frequencies is independent of the spoken/sung pitch.
ok, but how i perform this whit FFTW? can you post a simple pseudo-code to do that? thank in advance gianluca
acid_burn@inwind wrote:
..
> > ok, but how i perform this whit FFTW? can you post a simple pseudo-code to > do that? >
Doing the FFT is just the first stage. Posting pseudo-code that would be of any use is more than I can take on right now. I suggest you look at the CLAM sources: http://www.iua.upf.es/mtg/clam/ This has loads of C++ code (using FFTW, but possibly still v2) for extracting spectral envelopes, finding peaks, pitch extraction, etc. You may find CLAM of interest anyway, it is a widely used library of classes for sound analysis and processing, with some very cool GUI tools as well. Richard Dobson
Richard Dobson wrote:

> In any case, the database of vowel format > frequencies is independent of the spoken/sung pitch.
Can you suggest a page that discusses that? I explored http://hyperphysics.phy-astr.gsu.edu/hbase/music/vowel.html . That site is geared towards pseudo-random exploration. I'm looking for something more akin to a "guided tour".
Richard Owlett wrote:

> >> In any case, the database of vowel format frequencies is independent >> of the spoken/sung pitch. > > Can you suggest a page that discusses that? >
The best I can find on a quick Google is: http://www2.sfu.ca/sonic-studio/handbook/Formant.html You need to look for publications by Johan Sundberg, he did the original research on vocal formants, some time ago now. There is relatively little of his material directly on the net, most is in books, journals. If you Google on "formant" + "Sundberg", you should find most of whatever is available. Richard Dobson