comp.dsp | interpretting FFT results

I've been programming a long time and I recently decided to write software
to detect pitch in an audio stream in real time. I have tried a few FFT
implementations (including FFTW), however, I don't really understand how
to interpret the results.

I have a WAV file that plays a continuous tone, middle C (~260hz). I grab
a number of samples (eg 256) and apply a hamming window, then put it
through the FFT. I've tried analyzing the resulting data in many ways, but
I have yet to find that it gives me a frequency of 260hz.

I know this is probably very simple. :) Would anyone care to enlighten
me?

Reply by Tim Wescott ●March 10, 20082008-03-10

On Mon, 10 Mar 2008 09:13:49 -0500, NateS wrote:

> I've been programming a long time and I recently decided to write
> software to detect pitch in an audio stream in real time. I have tried a
> few FFT implementations (including FFTW), however, I don't really
> understand how to interpret the results.
> 
> I have a WAV file that plays a continuous tone, middle C (~260hz). I
> grab a number of samples (eg 256) and apply a hamming window, then put
> it through the FFT. I've tried analyzing the resulting data in many
> ways, but I have yet to find that it gives me a frequency of 260hz.
> 
> I know this is probably very simple. :) Would anyone care to enlighten
> me?

What frequencies are you getting?

You haven't mentioned sampling rate.  If you're sampling at 22000Hz, then 
a 256 sample FFT is going to give you frequency bins that are 86Hz wide, 
and those are going to be spread by your windowing.

Have you done a web search on pitch detection?  I'll admit to not having 
done it myself, but thinking about the problem and reading comments here 
tells me that it isn't trivial.  Hearing pitch is a human perception 
thing, so getting it right with a DSP requires a good knowledge of the 
psychoacoustics of the problem.

-- 
Tim Wescott
Control systems and communications consulting
http://www.wescottdesign.com

Need to learn how to apply control theory in your embedded system?
"Applied Control Theory for Embedded Systems" by Tim Wescott
Elsevier/Newnes, http://www.wescottdesign.com/actfes/actfes.html

Reply by Fred Marshall ●March 10, 20082008-03-10

"NateS" <misc@n4te.com> wrote in message 
news:Ju6dncWvx4qA30janZ2dnUVZ_vumnZ2d@giganews.com...
> I've been programming a long time and I recently decided to write software
> to detect pitch in an audio stream in real time. I have tried a few FFT
> implementations (including FFTW), however, I don't really understand how
> to interpret the results.
>
> I have a WAV file that plays a continuous tone, middle C (~260hz). I grab
> a number of samples (eg 256) and apply a hamming window, then put it
> through the FFT. I've tried analyzing the resulting data in many ways, but
> I have yet to find that it gives me a frequency of 260hz.
>
> I know this is probably very simple. :) Would anyone care to enlighten
> me?

A general FFT would start with N complex values and result in N complex 
values.
A real time sequence would have its imaginary part zero.
There will be some sample rate or frequency.  Let's call it fs.
Accordingly there will be a time sample interval T=1/fs.
And, there will be a time span NT that's associated with the N samples and 
scaled in time by T.
So, let's say that fs=1,020Hz, so T=0.001 seconds.
With N=256 then the time span is NT=.25 seconds.

But why did you select NT=0.25 or N=256????  How does one do that?

Now let's look on the frequency side of the transform:

With N=256 and fs=1,024 the frequency sampling interval will be fs/N = 4Hz = 
1/NT.

Usually the scale of the frequency output of N samples goes from zero to 
fs-1/NT.
You look at this as if it's one period of a periodic function - so the 
"next" sample at fs would be identical to the sample at f=0.
Also, the real values are symmetrical around f=0 which means also around fs.
Also, the imaginary values are antisymmetrical around f=0 which means also 
around fs.
So, the magnitude is symmetrical arount f=0, fs, 2fs .....
Accordingly, you may find the first N/2 frequency samples interesting and 
the last N/2 frequency samples just redundant for your purposes.

--------

If 4Hz is good enough resolution for your application then .25 seconds of 
data was enough.  But, if 4Hz isn't good enough then you need a longer time 
span.

Then, if you want to see changes in tones that occur in 1 second then you'd 
better be dealing with a time span of maybe 0.5Hz, eh?

Good frequency resolution competes with good time resolution and vice versa.

If you're looking for energy peaks then you  probably need to convert the 
complex frequency values to magnitude values.

I hope something in here helps you.

Fred

Reply by Ron N. ●March 10, 20082008-03-10

On Mar 10, 6:13 am, "NateS" <m...@n4te.com> wrote:
> I've been programming a long time and I recently decided to write software
> to detect pitch in an audio stream in real time. I have tried a few FFT
> implementations (including FFTW), however, I don't really understand how
> to interpret the results.
>
> I have a WAV file that plays a continuous tone, middle C (~260hz). I grab
> a number of samples (eg 256) and apply a hamming window, then put it
> through the FFT. I've tried analyzing the resulting data in many ways, but
> I have yet to find that it gives me a frequency of 260hz.

If you are sampling at 44.1 kHz you may need a lot longer
window.  To make it easy and not require the application of
interpolation methods, you need an FFT of length roughly in
the range of the reciprocal of the frequency resolution you
desire (e.g. about 1 seconds worth of samples for abut 1 Hz
resolution, but depending on the noise or interference levels).
You also need your pitch of interest to be well above the
first few FFT bins (which are spaced apart by the FFT length).

Also note that pitch is a perceptual phenomena different
from spectral frequency content, so the FFT results may
disagree, depending on such factors as the harmonic content,
any nearby pitches, vibrato, the influence from prior
transients, expectations set up by earlier musical content,
and etc.

IMHO. YMMV.
--
rhn A.T nicholson d.0.t C-o-M
  http://www.nicholson.com/rhn/dsp.html

Reply by NateS ●March 10, 20082008-03-10

Thanks for the replies. I posted via dsprelated.com and the post being my
first took some 5 days to actually be posted. I have been reading for much
of that time, so thankfully I'm pretty sure I grok everything you guys have
said so far.

I am now able to interpret the FFT results, what frequencies and at what
magnitude. The only part I don't really understand is: what is the scale
of the magnitude?

My current issue you guys already touched upon -- FFT bin resolution.
Speech is roughly 45hz to 600hz and the resolution needed in this range is
~2.7hz down low up to ~30hz in the 500-600hz range. The sample rate is 8000
to 44100 since users can configure their microphone in various ways. I
could require a specific sample rate I suppose.

It seems that the number of samples needed by my FFT is too long to be
reasonable. I want users to see their pitch in real time, so the delay
must be low. Maybe 200 milliseconds or less? At this point I am thinking
autocorrelation may be a better solution, though I admit to not having
looked into that very deeply yet.

I do understand that the fundamental frequency can be logically deduced
but that the pitch a human hears may be quite different due to other
frequencies being present. At this point I am only worried about
determining the fundamental frequency and I'll worry about mapping it to
pitch later.

Reply by Ron N. ●March 10, 20082008-03-10

On Mar 10, 11:46 am, "NateS" <m...@n4te.com> wrote:
> Thanks for the replies. I posted via dsprelated.com and the post being my
> first took some 5 days to actually be posted. I have been reading for much
> of that time, so thankfully I'm pretty sure I grok everything you guys have
> said so far.
>
> I am now able to interpret the FFT results, what frequencies and at what
> magnitude. The only part I don't really understand is: what is the scale
> of the magnitude?

Depends on your data scaling and the particulars of your
FFT implementation.  There seem to be at least 3 common
methods in how FFT results are scaled.

> My current issue you guys already touched upon -- FFT bin resolution.
> Speech is roughly 45hz to 600hz and the resolution needed in this range is
> ~2.7hz down low up to ~30hz in the 500-600hz range.

This appears roughly consistent with trying to resolve
semitone separation in the Western equal-tempered scale.

> It seems that the number of samples needed by my FFT is too long to be
> reasonable.

The number of data samples used does not need to be equal
to the length of your FFT.

> I want users to see their pitch in real time, so the delay
> must be low. Maybe 200 milliseconds or less?

200 mS is good for a 5 Hz FFT bin spacing.  If the noise
and interference level is low enough, then interpolation
methods may provide better frequency measurement resolution.
Zero padding your data and using a longer FFT is one method
of frequency domain interpolation.  Other methods include
parabolic interpolation, as well as cross-correlation with
the transform of the window function, as has been discussed
here in other threads.

> At this point I am thinking
> autocorrelation may be a better solution, though I admit to not having
> looked into that very deeply yet.

Autocorrelation methods are another potential solution,
as well as phase vocoder methods.  I have a list containing
a summary of a few frequency estimation methods on my web
page:  http://www.nicholson.com/rhn/dsp.html

> I do understand that the fundamental frequency can be logically deduced
> but that the pitch a human hears may be quite different due to other
> frequencies being present. At this point I am only worried about
> determining the fundamental frequency and I'll worry about mapping it to
> pitch later.

Understanding that is key to interpreting the results to
your experiments.

IMHO. YMMV.
--
rhn A.T nicholson d.0.t C-o-M

Reply by Krishna ●March 10, 20082008-03-10

On Mar 10, 7:13 pm, "NateS" <m...@n4te.com> wrote:
> I've been programming a long time and I recently decided to write software
> to detect pitch in an audio stream in real time. I have tried a few FFT
> implementations (including FFTW), however, I don't really understand how
> to interpret the results.
>
> I have a WAV file that plays a continuous tone, middle C (~260hz). I grab
> a number of samples (eg 256) and apply a hamming window, then put it
> through the FFT. I've tried analyzing the resulting data in many ways, but
> I have yet to find that it gives me a frequency of 260hz.
>
> I know this is probably very simple. :) Would anyone care to enlighten
> me?

Firstly, you need to know the sampling frequency of your capture. If
(a) sampling frequency is fs and
(b) the number of points in the FFT is N, then
the frequencies which you can see are from
[-N/2:(N/2-1)]*fs/N

For example, with fs = 20MHz and a N=64 pt FFT, we can see frequencies
from [-10 to +10) MHz with a resolution of 312.5kHz.

Maybe the post
http://www.dsplog.com/2007/06/17/interpreting-the-output-of-fft-operation-in-matlab/
is helpful.

Krishna,
~blogs at http://www.dsplog.com

interpretting FFT results

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group