Forums

interpretting FFT results

Started by NateS March 10, 2008
I've been programming a long time and I recently decided to write software
to detect pitch in an audio stream in real time. I have tried a few FFT
implementations (including FFTW), however, I don't really understand how
to interpret the results.

I have a WAV file that plays a continuous tone, middle C (~260hz). I grab
a number of samples (eg 256) and apply a hamming window, then put it
through the FFT. I've tried analyzing the resulting data in many ways, but
I have yet to find that it gives me a frequency of 260hz.

I know this is probably very simple. :) Would anyone care to enlighten
me?


On Mon, 10 Mar 2008 09:13:49 -0500, NateS wrote:

> I've been programming a long time and I recently decided to write > software to detect pitch in an audio stream in real time. I have tried a > few FFT implementations (including FFTW), however, I don't really > understand how to interpret the results. > > I have a WAV file that plays a continuous tone, middle C (~260hz). I > grab a number of samples (eg 256) and apply a hamming window, then put > it through the FFT. I've tried analyzing the resulting data in many > ways, but I have yet to find that it gives me a frequency of 260hz. > > I know this is probably very simple. :) Would anyone care to enlighten > me?
What frequencies are you getting? You haven't mentioned sampling rate. If you're sampling at 22000Hz, then a 256 sample FFT is going to give you frequency bins that are 86Hz wide, and those are going to be spread by your windowing. Have you done a web search on pitch detection? I'll admit to not having done it myself, but thinking about the problem and reading comments here tells me that it isn't trivial. Hearing pitch is a human perception thing, so getting it right with a DSP requires a good knowledge of the psychoacoustics of the problem. -- Tim Wescott Control systems and communications consulting http://www.wescottdesign.com Need to learn how to apply control theory in your embedded system? "Applied Control Theory for Embedded Systems" by Tim Wescott Elsevier/Newnes, http://www.wescottdesign.com/actfes/actfes.html
"NateS" <misc@n4te.com> wrote in message 
news:Ju6dncWvx4qA30janZ2dnUVZ_vumnZ2d@giganews.com...
> I've been programming a long time and I recently decided to write software > to detect pitch in an audio stream in real time. I have tried a few FFT > implementations (including FFTW), however, I don't really understand how > to interpret the results. > > I have a WAV file that plays a continuous tone, middle C (~260hz). I grab > a number of samples (eg 256) and apply a hamming window, then put it > through the FFT. I've tried analyzing the resulting data in many ways, but > I have yet to find that it gives me a frequency of 260hz. > > I know this is probably very simple. :) Would anyone care to enlighten > me?
A general FFT would start with N complex values and result in N complex values. A real time sequence would have its imaginary part zero. There will be some sample rate or frequency. Let's call it fs. Accordingly there will be a time sample interval T=1/fs. And, there will be a time span NT that's associated with the N samples and scaled in time by T. So, let's say that fs=1,020Hz, so T=0.001 seconds. With N=256 then the time span is NT=.25 seconds. But why did you select NT=0.25 or N=256???? How does one do that? Now let's look on the frequency side of the transform: With N=256 and fs=1,024 the frequency sampling interval will be fs/N = 4Hz = 1/NT. Usually the scale of the frequency output of N samples goes from zero to fs-1/NT. You look at this as if it's one period of a periodic function - so the "next" sample at fs would be identical to the sample at f=0. Also, the real values are symmetrical around f=0 which means also around fs. Also, the imaginary values are antisymmetrical around f=0 which means also around fs. So, the magnitude is symmetrical arount f=0, fs, 2fs ..... Accordingly, you may find the first N/2 frequency samples interesting and the last N/2 frequency samples just redundant for your purposes. -------- If 4Hz is good enough resolution for your application then .25 seconds of data was enough. But, if 4Hz isn't good enough then you need a longer time span. Then, if you want to see changes in tones that occur in 1 second then you'd better be dealing with a time span of maybe 0.5Hz, eh? Good frequency resolution competes with good time resolution and vice versa. If you're looking for energy peaks then you probably need to convert the complex frequency values to magnitude values. I hope something in here helps you. Fred
On Mar 10, 6:13 am, "NateS" <m...@n4te.com> wrote:
> I've been programming a long time and I recently decided to write software > to detect pitch in an audio stream in real time. I have tried a few FFT > implementations (including FFTW), however, I don't really understand how > to interpret the results. > > I have a WAV file that plays a continuous tone, middle C (~260hz). I grab > a number of samples (eg 256) and apply a hamming window, then put it > through the FFT. I've tried analyzing the resulting data in many ways, but > I have yet to find that it gives me a frequency of 260hz.
If you are sampling at 44.1 kHz you may need a lot longer window. To make it easy and not require the application of interpolation methods, you need an FFT of length roughly in the range of the reciprocal of the frequency resolution you desire (e.g. about 1 seconds worth of samples for abut 1 Hz resolution, but depending on the noise or interference levels). You also need your pitch of interest to be well above the first few FFT bins (which are spaced apart by the FFT length). Also note that pitch is a perceptual phenomena different from spectral frequency content, so the FFT results may disagree, depending on such factors as the harmonic content, any nearby pitches, vibrato, the influence from prior transients, expectations set up by earlier musical content, and etc. IMHO. YMMV. -- rhn A.T nicholson d.0.t C-o-M http://www.nicholson.com/rhn/dsp.html
Thanks for the replies. I posted via dsprelated.com and the post being my
first took some 5 days to actually be posted. I have been reading for much
of that time, so thankfully I'm pretty sure I grok everything you guys have
said so far.

I am now able to interpret the FFT results, what frequencies and at what
magnitude. The only part I don't really understand is: what is the scale
of the magnitude?

My current issue you guys already touched upon -- FFT bin resolution.
Speech is roughly 45hz to 600hz and the resolution needed in this range is
~2.7hz down low up to ~30hz in the 500-600hz range. The sample rate is 8000
to 44100 since users can configure their microphone in various ways. I
could require a specific sample rate I suppose.

It seems that the number of samples needed by my FFT is too long to be
reasonable. I want users to see their pitch in real time, so the delay
must be low. Maybe 200 milliseconds or less? At this point I am thinking
autocorrelation may be a better solution, though I admit to not having
looked into that very deeply yet.

I do understand that the fundamental frequency can be logically deduced
but that the pitch a human hears may be quite different due to other
frequencies being present. At this point I am only worried about
determining the fundamental frequency and I'll worry about mapping it to
pitch later.
On Mar 10, 11:46 am, "NateS" <m...@n4te.com> wrote:
> Thanks for the replies. I posted via dsprelated.com and the post being my > first took some 5 days to actually be posted. I have been reading for much > of that time, so thankfully I'm pretty sure I grok everything you guys have > said so far. > > I am now able to interpret the FFT results, what frequencies and at what > magnitude. The only part I don't really understand is: what is the scale > of the magnitude?
Depends on your data scaling and the particulars of your FFT implementation. There seem to be at least 3 common methods in how FFT results are scaled.
> My current issue you guys already touched upon -- FFT bin resolution. > Speech is roughly 45hz to 600hz and the resolution needed in this range is > ~2.7hz down low up to ~30hz in the 500-600hz range.
This appears roughly consistent with trying to resolve semitone separation in the Western equal-tempered scale.
> It seems that the number of samples needed by my FFT is too long to be > reasonable.
The number of data samples used does not need to be equal to the length of your FFT.
> I want users to see their pitch in real time, so the delay > must be low. Maybe 200 milliseconds or less?
200 mS is good for a 5 Hz FFT bin spacing. If the noise and interference level is low enough, then interpolation methods may provide better frequency measurement resolution. Zero padding your data and using a longer FFT is one method of frequency domain interpolation. Other methods include parabolic interpolation, as well as cross-correlation with the transform of the window function, as has been discussed here in other threads.
> At this point I am thinking > autocorrelation may be a better solution, though I admit to not having > looked into that very deeply yet.
Autocorrelation methods are another potential solution, as well as phase vocoder methods. I have a list containing a summary of a few frequency estimation methods on my web page: http://www.nicholson.com/rhn/dsp.html
> I do understand that the fundamental frequency can be logically deduced > but that the pitch a human hears may be quite different due to other > frequencies being present. At this point I am only worried about > determining the fundamental frequency and I'll worry about mapping it to > pitch later.
Understanding that is key to interpreting the results to your experiments. IMHO. YMMV. -- rhn A.T nicholson d.0.t C-o-M
On Mar 10, 7:13 pm, "NateS" <m...@n4te.com> wrote:
> I've been programming a long time and I recently decided to write software > to detect pitch in an audio stream in real time. I have tried a few FFT > implementations (including FFTW), however, I don't really understand how > to interpret the results. > > I have a WAV file that plays a continuous tone, middle C (~260hz). I grab > a number of samples (eg 256) and apply a hamming window, then put it > through the FFT. I've tried analyzing the resulting data in many ways, but > I have yet to find that it gives me a frequency of 260hz. > > I know this is probably very simple. :) Would anyone care to enlighten > me?
Firstly, you need to know the sampling frequency of your capture. If (a) sampling frequency is fs and (b) the number of points in the FFT is N, then the frequencies which you can see are from [-N/2:(N/2-1)]*fs/N For example, with fs = 20MHz and a N=64 pt FFT, we can see frequencies from [-10 to +10) MHz with a resolution of 312.5kHz. Maybe the post http://www.dsplog.com/2007/06/17/interpreting-the-output-of-fft-operation-in-matlab/ is helpful. Krishna, ~blogs at http://www.dsplog.com