comp.dsp | Vowel recognizer using FFTW

Hi! I must build a vowel recognizer using the library FFTW:
analyzing a .wav file, I must retrieve the fundamental and the armonics,
then compare these with fundamental and armonics of other .wav files
previously archived to choose the vowel "most" closeness.

i followed these steps:
- first, I load the samples from the .wav file into an array of
fftw_complex, using 0.0 as imaginary parts;
- then, perform a c2c DFT using FFTW_ESTIMATE as flag; the length of the
DFT is the number of samples (say NS) in the .wav file (in general, this
number ISN'T power of 2);
- last, i've got an array of fftw_complex; the length of the array is NS.

now, I must retrieve the fundamental and the armonics from this array.
how can I interpretate the values of the array? i've read the manual of
FFTW, but the problem is still unresolved.

thanks in advance
gianluca

PS: I apologize for my bad english, I'm italian...

Reply by Rune Allnor ●March 22, 20062006-03-22

acid_burn@inwind wrote:
> Hi! I must build a vowel recognizer using the library FFTW:
> analyzing a .wav file, I must retrieve the fundamental and the armonics,
> then compare these with fundamental and armonics of other .wav files
> previously archived to choose the vowel "most" closeness.
>
> i followed these steps:
> - first, I load the samples from the .wav file into an array of
> fftw_complex, using 0.0 as imaginary parts;
> - then, perform a c2c DFT using FFTW_ESTIMATE as flag; the length of the
> DFT is the number of samples (say NS) in the .wav file (in general, this
> number ISN'T power of 2);
> - last, i've got an array of fftw_complex; the length of the array is NS.
>
> now, I must retrieve the fundamental and the armonics from this array.
> how can I interpretate the values of the array? i've read the manual of
> FFTW, but the problem is still unresolved.

Hmmm...

I don't have much experience with speech processing, but it seems your
general approach is a bit unconventional. It seems to me that most
people
use some sorty of LPC approach.

Anyway, let's have a look at your problem:

First, the problem consits of two parts: Pitch and wovel. It is not
reasonable
to expect that a given wovel at a high pitch should compare well with
the
same wovel at a lower pitch, so the first task would be to normalize
the signal spectrum. One could, for instance, use some sort of AM
scheme
to modulat the detected pitch to some normalized refernce pitch.

Once that is done, you might try to compare the normalized spectrum
with
the different (normalized) refernce spectra, and see what fits best.

> PS: I apologize for my bad english, I'm italian...

Don't worry. Your English is WAY better than my Italian... I spent
three
monts in Italy a few years ago. At the end, my Italian was *just*
sufficient 
for me making my way in and out of a resturant...

Rune

Reply by acid...@inwind ●March 22, 20062006-03-22

Hi!

>the signal spectrum. One could, for instance, use some sort of AM
>scheme
>to modulat the detected pitch to some normalized refernce pitch.

what does "AM scheme" means? is there any library to perform the
normalization with this method? con i do this with FFTW?

the FFTW's manual says:
"Note also that we use the standard &#4294967295;in-order&#4294967295; output ordering&#4294967295;the k-th
output corresponds to the frequency k/n (or k/T, where T is your total
sampling period)."

what does it means?

take an example - i have an array like that:

 0  1  2  3  4  5  <-- Indexes (n=6)
10 12  7 15  1  8  <-- Data

the array is just an example and it isn't symmetric. now, excluding a[0]
(DC Amplitude) and a[3] (Nyquist Amplitude), 12 is the level of the
frequency 1/6? 7 is the level of the frequency 2/6? 1 the level of the
frequency 4/6? isn't right? how to find the fundamental and the armonics
from this array? what is DC Amplitude and Nyquist Amplitude?

thank in advance
gianluca

Reply by Rune Allnor ●March 22, 20062006-03-22

acid_burn@inwind wrote:
> Hi!
>
> >the signal spectrum. One could, for instance, use some sort of AM
> >scheme
> >to modulat the detected pitch to some normalized refernce pitch.
>
> what does "AM scheme" means? is there any library to perform the
> normalization with this method? con i do this with FFTW?

"AM" means "amplitude modulator."

> the FFTW's manual says:
> "Note also that we use the standard "in-order" output ordering-the k-th
> output corresponds to the frequency k/n (or k/T, where T is your total
> sampling period)."
>
> what does it means?
>
> take an example - i have an array like that:
>
>  0  1  2  3  4  5  <-- Indexes (n=6)
> 10 12  7 15  1  8  <-- Data
>
> the array is just an example and it isn't symmetric. now, excluding a[0]
> (DC Amplitude) and a[3] (Nyquist Amplitude), 12 is the level of the
> frequency 1/6? 7 is the level of the frequency 2/6? 1 the level of the
> frequency 4/6? isn't right? how to find the fundamental and the armonics
> from this array? what is DC Amplitude and Nyquist Amplitude?

It seems you could benefit from reading a text on DSP. Try

Lyons: Understanding Digital Signal Processing
    Prentice-Hall, 2004.

It will answer most of your question in this post.

Rune

Reply by Richard Dobson ●March 22, 20062006-03-22

Rune Allnor wrote:
> acid_burn@inwind wrote:
>> Hi! I must build a vowel recognizer using the library FFTW:
>> analyzing a .wav file, I must retrieve the fundamental and the armonics,
>> then compare these with fundamental and armonics of other .wav files
>> previously archived to choose the vowel "most" closeness.
>..
> First, the problem consits of two parts: Pitch and wovel. It is not
> reasonable
> to expect that a given wovel at a high pitch should compare well with
> the
> same wovel at a lower pitch, so the first task would be to normalize
> the signal spectrum. One could, for instance, use some sort of AM
> scheme
> to modulat the detected pitch to some normalized refernce pitch.

This sounds unnecessary. The primary task in vowel recognition is to 
extract and identify the vocal formants which in turn make up the 
spectral envelope, all of which is independent of pitch. For an 
illustration see e.g.:

http://hyperphysics.phy-astr.gsu.edu/hbase/music/vowel.html

Extracting a spectral envelope is in effect a low-pass filtering process 
on a frame of FFT amplitudes (I am used to thinking in terms of the 
phase vocoder, so these are  the amplitudes calculated with "hypot()" 
from the raw complex output of the FFT), to find the overall shape of 
the spectrum, and indeed to ignore small-scale deviations representing 
individual partials.

Many vowels are  dipthongs, and (for speech especially) are 
characterised by pitch rises or falls, so one does need to extract the 
pitch trajectory from the sound as well to identify these. Finding the 
fundamental is sufficient; but one may prefer to derive this from 
detected harmonics as FFT resolution is typically better "up there". 
This in turn implies that one needs to detect the actual (or relative) 
pitch of a vowel combination, and not to normalise everything to a 
single reference pitch. In any case, the database of vowel format 
frequencies is independent of the spoken/sung pitch.

Richard Dobson.

Reply by acid...@inwind ●March 22, 20062006-03-22

Hi!

>It seems you could benefit from reading a text on DSP. Try

this is right, i understand my lacks in dsp theories, but i've no time to
read entirely a book...

i need to simply undestand how to interpretate the FFTW's output array and
how to extract from there the fundamental and the armonics

i've seen that trasforming all the samples in one step, i have in output a
wave that has all frequencies near the 0-frequency. maybe must to extract a
little subset of pitchs from the recorded wave?

thanks in advance

gianluca

Reply by acid...@inwind ●March 22, 20062006-03-22

Hi! 

>Extracting a spectral envelope is in effect a low-pass filtering process

>on a frame of FFT amplitudes (I am used to thinking in terms of the 
>phase vocoder, so these are  the amplitudes calculated with "hypot()" 
>from the raw complex output of the FFT), to find the overall shape of 
>the spectrum, and indeed to ignore small-scale deviations representing 
>individual partials.
>
>Many vowels are  dipthongs, and (for speech especially) are 
>characterised by pitch rises or falls, so one does need to extract the 
>pitch trajectory from the sound as well to identify these. Finding the 
>fundamental is sufficient; but one may prefer to derive this from 
>detected harmonics as FFT resolution is typically better "up there". 
>This in turn implies that one needs to detect the actual (or relative) 
>pitch of a vowel combination, and not to normalise everything to a 
>single reference pitch. In any case, the database of vowel format 
>frequencies is independent of the spoken/sung pitch.

ok, but how i perform this whit FFTW? can you post a simple pseudo-code to
do that?

thank in advance

gianluca

Reply by Richard Dobson ●March 22, 20062006-03-22

acid_burn@inwind wrote:
..
> 
> ok, but how i perform this whit FFTW? can you post a simple pseudo-code to
> do that?
> 

Doing the FFT is just the first stage. Posting pseudo-code that would be 
of any use is more than I can take on right now. I suggest you look at 
the CLAM sources:

http://www.iua.upf.es/mtg/clam/

This has loads of C++ code (using FFTW, but possibly still v2) for 
extracting spectral envelopes, finding peaks, pitch extraction, etc. You 
may find CLAM of interest anyway, it is a widely used library of classes 
for sound analysis and processing, with some very cool GUI tools as well.

Richard Dobson

Reply by Richard Owlett ●March 23, 20062006-03-23

Richard Dobson wrote:

> In any case, the database of vowel format 
> frequencies is independent of the spoken/sung pitch.

Can you suggest a page that discusses that?

I explored
http://hyperphysics.phy-astr.gsu.edu/hbase/music/vowel.html .

That site is geared towards pseudo-random exploration.
I'm looking for something more akin to a "guided tour".

Reply by Richard Dobson ●March 23, 20062006-03-23

Richard Owlett wrote:

> 
>> In any case, the database of vowel format frequencies is independent 
>> of the spoken/sung pitch.
> 
> Can you suggest a page that discusses that?
> 

The best I can find on a quick Google is:

http://www2.sfu.ca/sonic-studio/handbook/Formant.html

You need to look for publications by Johan Sundberg, he did the original 
research on vocal formants, some time ago now. There is relatively 
little of his material directly on the net, most is in books, journals.
If you Google on "formant" + "Sundberg", you should find most of 
whatever is available.

Richard Dobson

Previous12 Next

Vowel recognizer using FFTW

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group