DSPRelated.com
Forums

DFT and resonance -- question

Started by Michael April 10, 2006
Hi,

I have what I think are two simple questions, but I haven't been able
to find anything that directly answers them, after a lot of looking. If
this is really newbie stuff, could one person please just point me to a
suitable reference?

I'm interested in audio spectral analysis that matches the human ear as
much as possible. The vast majority is based on Fast Fourier Transform,
of course. But the human ear is based on hairs that resonate at a
particular frequency, so it can be modelled by setting up harmonic
oscillators at spaced frequencies driven by the audio signal (while
I've seen this mentioned, I haven't found any real-life software
examples).

My question is: to what extent does a discrete fourier transform, made
over successive windows (for example as produced by the free program
Praat) match or differ from a spectral analysis created by modelling
harmonic oscillators that pick up frequencies? (which from what I
understand is exactly what bandpass filters / RLC circuits do.) Do they
essentially result in the same results for practical audio purposes,
which is why almost everything is geared towards FFT instead of
bandpass filters, or do they differ in considerable ways?

Second, related question: My interest is in representing audio visually
in intuitive ways, and as I've been researching FFT and
signal-processing in general I've been rather surprised to find that
spectograms-over-time are vastly more 'muddy' than one's own audio
interpretation. I realize the brain performs plenty of neural
processing to recognize frequencies, overtones, etc., However, while
all the explanations-with-pictures I've seen so far of FFT are
accompanied by the explanation that frequency resolution increases with
the length of the sample, I have seen no representation that looks like
it distinguishes frequencies without muddiness anywhere near as well as
the human ear does, where a C and a C# are obviously and clearly
different and separated. I've never seen a voiceprint diagram where the
distance between the fundamental and the first overtone (of one octave)
leaves clear space for 11 unambiguously distinct pitches in between.
Obviously if the ear does it the computer must be able to match it. So
my question is: am I just looking in the wrong places (and so where
should I look?), or is there no audio analysis algorithm that matches
the ear in time and frequency resolution, and then is the obstacle lack
of pattern-recognition technology or something else?

And actually a third question about the ear: does anyone in this group
know if the hair cells that pick up frequencies are grouped to respond
to discrete frequencies that are spaced out along intervals (i.e. from
one group of 20 hair cells to another there's a jump of 10 cents) or is
it just a very fine, gradual change in frequency without any uniform
spacing (i.e. every hair cell is around 0.1 cents higher the last, and
cell stimuli are averaged without belonging to discrete groups)? It's
the concept I'm wondering about, I just made up those numbers to
illustrate.

Again, sorry if this is really newbie stuff but I just haven't been
able to find answers to basic applied questions like these.

Thanks,
Michael Baldwin

Michael wrote:
> to what extent does a discrete fourier transform, made > over successive windows (for example as produced by the free program > Praat) match or differ from a spectral analysis created by modelling > harmonic oscillators that pick up frequencies? (which from what I > understand is exactly what bandpass filters / RLC circuits do.) Do they > essentially result in the same results for practical audio purposes, > which is why almost everything is geared towards FFT instead of > bandpass filters, or do they differ in considerable ways?
they differ. but the BPF bank is not the ideal model of human hearing either.
> Second, related question: My interest is in representing audio visually > in intuitive ways, and as I've been researching FFT and > signal-processing in general I've been rather surprised to find that > spectograms-over-time are vastly more 'muddy' than one's own audio > interpretation. I realize the brain performs plenty of neural > processing to recognize frequencies, overtones, etc., However, while > all the explanations-with-pictures I've seen so far of FFT are > accompanied by the explanation that frequency resolution increases with > the length of the sample, I have seen no representation that looks like > it distinguishes frequencies without muddiness anywhere near as well as > the human ear does, where a C and a C# are obviously and clearly > different and separated.
human discrimination of pitch is better than that. we can discriminate frequencies that are about 1/3% apart, which is about 6 cents or 6/100of a semitone. the 12 note per octave musical scale has different roots in the mathematics surrounding the perception of harmony and dissonance and the ergonomic cost regarding how many keys on a particular instrument.
> I've never seen a voiceprint diagram where the > distance between the fundamental and the first overtone (of one octave) > leaves clear space for 11 unambiguously distinct pitches in between.
there is no quantization done by our hearing to 12 notes per octave. instruments from other cultures than the west, particularly old and traditional instruments are not 12 notes per octave.
> Obviously if the ear does it the computer must be able to match it. So > my question is: am I just looking in the wrong places (and so where > should I look?), or is there no audio analysis algorithm that matches > the ear in time and frequency resolution, and then is the obstacle lack > of pattern-recognition technology or something else?
there are piles of JASA and JAES articles about the psychoacoustics of hearing loudness, pitch, and timbre of sounds. maybe you might want to begin looking with the keywords "Bark scale". i dunno.
> And actually a third question about the ear: does anyone in this group > know if the hair cells that pick up frequencies are grouped to respond > to discrete frequencies that are spaced out along intervals (i.e. from > one group of 20 hair cells to another there's a jump of 10 cents) or is > it just a very fine, gradual change in frequency without any uniform > spacing (i.e. every hair cell is around 0.1 cents higher the last, and > cell stimuli are averaged without belonging to discrete groups)?
it's more continuous in spacing and it isn't 0.1 cent. i am not sure that it is spaced equally in log frequency anyway. the bark scale is not the same as the log frequency scale.
> It's the concept I'm wondering about, I just made up those numbers to > illustrate. > > Again, sorry if this is really newbie stuff but I just haven't been > able to find answers to basic applied questions like these.
as your questions get more specific, i am sure you'll find what you need with Google. r b-j
You might find the following links of interest:

http://www.phon.ucl.ac.uk/resource/cochsim/

ftp://ftp.phon.ucl.ac.uk/pub/polyfit/

John

Michael skrev:

> Second, related question: My interest is in representing audio visually > in intuitive ways, and as I've been researching FFT and > signal-processing in general I've been rather surprised to find that > spectograms-over-time are vastly more 'muddy' than one's own audio > interpretation. I realize the brain performs plenty of neural > processing to recognize frequencies, overtones, etc., However, while > all the explanations-with-pictures I've seen so far of FFT are > accompanied by the explanation that frequency resolution increases with > the length of the sample, I have seen no representation that looks like > it distinguishes frequencies without muddiness anywhere near as well as > the human ear does,
I have used sounds of birds to demonstrate both the spectrogram and the differences between DSP and the human auditory system. For instance, the scream of a loon is very much a prolonged chirp pulse, while the "cuckoo" of a cuckoo is easy to interpret from the spectrogram with a little practice. Then I show a spectrogram of a collection of birds recorded by the beach. While the human ear easily separates the eiders from the gulls, ducks and oystercatchers, the spectrogram is a mess.
> where a C and a C# are obviously and clearly > different and separated. I've never seen a voiceprint diagram where the > distance between the fundamental and the first overtone (of one octave) > leaves clear space for 11 unambiguously distinct pitches in between.
It has to do with the duration of the speech segment versus required accuracy. I think the individual speech segments are on the order of 10-20 ms, which ought to represent a ferquency separation on the order of 50-100 Hz.
> Obviously if the ear does it the computer must be able to match it.
Opinions might be divided on that particular matter...
> So > my question is: am I just looking in the wrong places (and so where > should I look?), or is there no audio analysis algorithm that matches > the ear in time and frequency resolution, and then is the obstacle lack > of pattern-recognition technology or something else?
Psychoacoustics (or psycho acoustics) is an interesting area. The sound of speech is not a pure sinusoidal, as the DFT is designed to pick up, but consists of a number of pulses and resonanses. Ther was the infmous woodpecker therd a few years ago, where the relative importance of such clues to the percieved sound was intensively debated. [BTW: Jerry, do you still have the problem with the aluminum on your chimney?]
> And actually a third question about the ear: does anyone in this group > know if the hair cells that pick up frequencies are grouped to respond > to discrete frequencies that are spaced out along intervals (i.e. from > one group of 20 hair cells to another there's a jump of 10 cents) or is > it just a very fine, gradual change in frequency without any uniform > spacing (i.e. every hair cell is around 0.1 cents higher the last, and > cell stimuli are averaged without belonging to discrete groups)? It's > the concept I'm wondering about, I just made up those numbers to > illustrate. > > Again, sorry if this is really newbie stuff but I just haven't been > able to find answers to basic applied questions like these.
You are touching on more serious matter than I think you realize. I have several times met the argument that "this ought to be easy" based on that the human ear can pick up some signature in a recorded sound. For instance, you and I can hear the difference between a cain saw, a tractor engine and an aircraft, right? So what, then, is the problem with deploying an automatic sound monitoring system that is supposed to analyze the added noise from the aircraft traffic around the new airport? It turns out that it is very difficult to design a DSP system that trigs only on aircraft noise, among several man-made noise sources. The human ear uses some sort of pattern-matching technique that will be very hard to match by means of DSP. Rune
Hi Michael,,

> So > my question is: am I just looking in the wrong places (and so where > should I look?), or is there no audio analysis algorithm that matches > the ear in time and frequency resolution, and then is the obstacle lack > of pattern-recognition technology or something else?
Rune gave great feedback on this matter. Another very interesting approach for source separation is through independent component analysis (ICA). It's a statistical approach that assumes that the source of each sound is independent of each other and has non-gaussian distribution. There is a really nice example using this approach that separates three musics that were linearly mixed together in a sound track.
> And actually a third question about the ear: does anyone in this group > know if the hair cells that pick up frequencies are grouped to respond > to discrete frequencies that are spaced out along intervals (i.e. from > one group of 20 hair cells to another there's a jump of 10 cents) or is > it just a very fine, gradual change in frequency without any uniform > spacing (i.e. every hair cell is around 0.1 cents higher the last, and > cell stimuli are averaged without belonging to discrete groups)? It's > the concept I'm wondering about, I just made up those numbers to > illustrate.
I would suggest you take a look at the Mel scale, which is the perceptual scale of pitch.
Thanks everybody for the feedback, it's fantastic and gave me all the
terms I needed to google! I now have two much more specific questions
that should be quite easy to answer..

1 -- Robert, regarding FFT you said "but the BPF bank is not the ideal
model of human hearing
either". So, what is the ideal spectral analysis algorithm we've got
today for simulating how the ear responds to the frequency spectrum?

2 -- Now that I've read a lot more, it seems that the best approach to
producing a sharp visual spectral representation of a sound would be to
use, say, several thousand bandpass filters, each sensitive to the Mel
range (so overlapping *greatly*) but spaced to the desired pitch
resolution, say 1 cent. Then, the resulting spectrum is quite blurry
(as it is always), but a "simple" deconvolution can be applied, since
we know exactly how any sine wave will 'blur out' in a (somewhat
gaussian?) style in the hundreds of filters around its original
frequency. Obviously this doesn't take into account timbre
analysis/overtones/etc, but it should result in the kind of clear,
precise pitch analysis I was interested in. So, is there
software/algorithms out there that do this now, that somebody could
point me to? Or is something wrong with my understanding? (And I don't
even want to think about what kind of computing power would be
necessary...)

My interest isn't so much in computer recognition of the sounds (i.e.
distinguishing between a chain saw and a tractor engine) but in
producing a visual representation where somebody familiar with the
representation could do it, and with VERY fine pitch distinction.
Although my interest is musical, not chainsaws... (though with some of
the modern pieces put together in my music composition class a few
years ago, maybe a chainsaw isn't so out of the question after all...
:P )

All you guys have been really really helpful, thank you so much.

Michael skrev:
> Thanks everybody for the feedback, it's fantastic and gave me all the > terms I needed to google! I now have two much more specific questions > that should be quite easy to answer..
Eh...
> 1 -- Robert, regarding FFT you said "but the BPF bank is not the ideal > model of human hearing > either". So, what is the ideal spectral analysis algorithm we've got > today for simulating how the ear responds to the frequency spectrum? > > 2 -- Now that I've read a lot more, it seems that the best approach to > producing a sharp visual spectral representation of a sound would be to > use, say, several thousand bandpass filters, each sensitive to the Mel > range (so overlapping *greatly*) but spaced to the desired pitch > resolution, say 1 cent.
You might find the Heissenberg resolution limit interesting. It basically says that one needs a signal of a certain duration in order to obtain a desired frequency resolution. The rule of thumb is df = 1/T In other words, if you want a frequency resolution, df, of 0.1 Hz, you need a duration T = 1/0.1 Hz = 10 seconds of recorded sound. And that means the tone needs to be stable during those ten seconds.
> Then, the resulting spectrum is quite blurry > (as it is always), but a "simple" deconvolution can be applied, since > we know exactly how any sine wave will 'blur out' in a (somewhat > gaussian?)
Nope, more like sin(f)/f.
> style in the hundreds of filters around its original > frequency. Obviously this doesn't take into account timbre > analysis/overtones/etc, but it should result in the kind of clear, > precise pitch analysis I was interested in.
Something like what you sketch ought to work for *one* stable sinusoidal. The problem is, as you indicate, the presence of overtones and other features.
> So, is there > software/algorithms out there that do this now, that somebody could > point me to? Or is something wrong with my understanding? (And I don't > even want to think about what kind of computing power would be > necessary...)
With the current PCs, spectrograms are no big deal unless you need real-time capacity.
> My interest isn't so much in computer recognition of the sounds (i.e. > distinguishing between a chain saw and a tractor engine) but in > producing a visual representation where somebody familiar with the > representation could do it, and with VERY fine pitch distinction.
These sorts of things have been done in the past, but for long-duration stable sinusoidals. The key issue is the properties of the sounds you use as input.
> Although my interest is musical, not chainsaws... (though with some of > the modern pieces put together in my music composition class a few > years ago, maybe a chainsaw isn't so out of the question after all... > :P ) > > All you guys have been really really helpful, thank you so much.
Rune
Thanks, Rune. OK, I'm looking for human-ear-type resolution, so if in
1/10th of a sec I can get resolution of 10 hZ around A (440 hZ), that's
perfect, since A#/Bb isn't until about 477 hZ.

Maybe I'm naive, but I'm not so worried about the overtones, because
they should appear quite clearly and dinstinctly, no? (As long as we're
talking just a few simultaneous notes such as in a piano sonata...I'm
not expecting a crystal-clear visual rendition of pitches in heavy
metal!!!) And other noise (cymbals, drums, hiss, etc.) should appear as
a continuous range of frequencies. I just want to be able to pick out
individual instrument/voice tones and overtones, and be able to clearly
see what pitch of the 12-tone scale they are. They can be along a
background of noise, that's fine, just as long as the pitches of a few
instruments stand out visually in the final spectogram, with
*identifiable* pitches--A, Bb, C, C#, etc...

If the algorithm I described, of a filterbank with freq separation of,
say, 10 hZ between bands (although like I said, the width of each band
will overlap with the next, right?) picks up a 'blurred' image of the
spectrum, then deconvolution should work to 'tighten up' the
frequencies, just as one turns a blurry astronomical image into a clear
one. This is based on the assumption that everything is generated by
sines, so it will clean up the sines (piano fundamental, clarinet,
etc), but cymbal crashes, noise, etc. will be left largely unchanged
(which is fine!).

So *is* there a spectrum analyzer out there that performs this
deconvolution step? Is there somewhere that explains exactly how to
apply this deconvolution? I don't want to reinvent the wheel... Or is a
10hZ-resolution band perfectly reasonable, no deconvolution necessary,
but for some reason it doesn't work well when done via FFT?

Michael wrote:
> > 1 -- Robert, regarding FFT you said "but the BPF bank is not the ideal > model of human hearing either". So, what is the ideal spectral analysis > algorithm we've got today for simulating how the ear responds to the > frequency spectrum?
boy, i dunno. you would have to talk to those folks that have worked on lossy compression algorithms (like the MP3 standard). there are both frequency resolution issues (which is salient for any BPF bank model) and temporal resolution issues (loud sounds masking quiet ones that happen immediately after). perhaps, for your purposes, a bank of BPFs or similar filters with equal log frequency spacing will be best, but i do not know. r b-j
Just an addition...I experimented around a lot more with Praat and
Beethoven's Moonlight Sonata, and have a very good feel now for
freq/time resolution. Basically, if I turn up the time resolution to
match what I intuitively feel my ear recognizes (say, 1/10th of a sec),
then the pitches turn into "caterpillars" about two notes wide each.
This makes sense, since it seems to match the whole Mel idea. If I want
what I intuitively feel the ear can sense in terms of pitch resolution,
I need to use a window of up to 1 second, and the time information goes
far too blurry.

Now after reading how the ear 'sharpens' pitches, the deconvolution
stage seems totally necessary and logical. Each 'caterpillar' in the
first example has an obvious center. When the pitch moves up, say, a
semitone, the caterpillar and its overtones shift upwards by about a
quarter of his body. Obviously, two notes a semitone apart will not be
resolved, but that's fine. And there's a wonderful Chopin etude that
takes advantage of this to great effect.

Conclusion: deconvolution is totally necessary and desirable for my
purposes, and a filterbank seems much more applicable than a FFT.
Again, if anybody can point me to how to apply this deconvolution (or
software that demonstrates it), I will greatly appreciate it and stay
quiet for a while here! :)

If anybody knows of an algorithm that 'removes' overtones from a
spectrogram, or alternatively identifies the probability of a given
frequency being a fundamental vs overtone given the intervals between
very-present frequencies above it, figuring out the integer multiples
etc, that would *really* make my day. Of course I realize this is going
out of DSP and into AI...

Again, thanks to everyone for all the help so far.