comp.dsp | Analyzing fundamental frequencies from musical signals| page 2

Reply by Ron N. ●February 25, 20062006-02-25

robert bristow-johnson wrote:
> i'm sorry Ron, but there isn't really a (blind) pre-filter that will
> make this work with polyphonic input where the singers' pitches are in
> the same ballpark (within an octave or two).  i suppose a non-blind
> comb filtering that knocks out tones that the pitch detector is not
> focusing on, but that either becomes a circular problem (knowing the
> pitch a priori to tune the comb filter) and still does not solve the
> problem of initially determining the pitch of the first voice from the
> cacophony of all of the other voices.

It sounds like the OP is a music teacher who is present at the
time of the recordings.  Therefore he probably has access to
a printed score sheet, and thus can calculate the approximate
pitch represented by the note that a particular part is supposed
to be singing at a particular point in the score (assuming the choir
isn't way out of tune of course).  That will solve the a priori
problem you mention.

If the other parts aren't singing octaves, why wouldn't a notch filter
over the other parts, or a bandpass filter around the dominant
overtone related to the pitch of interest, help reject interference
to an autocorrelation?

IMHO. YMMV.
-- 
rhn A.T nicholson d.0.t C-o-M

Reply by Ron N. ●February 25, 20062006-02-25

Bob Monsen wrote:
> So, not being an expert, perhaps I'm missing some crucial point here,
> but why can't he just use an overlapping (or maybe even non-overlapping)
> set of sequential FFTs to track the frequency changes?

It looks like it's the standard problem of resolution in frequency
versus resolution in time.  To get more resolution in frequency from
an FFT requires a longer FFT window, which, without zero padding,
results in less resolution in time, and/or might even require a
buffer longer than the sound of interest which would introduce
outside time-domain interference.

Autocorrelation can be done with a sample window as short as maybe
2 periods of the pitch of interest.  Phase vocoding is similar to
an autocorrelation of the signal after prefiltering by a form of
one-bin-wide bandpass filter, which helps a bit with noise and
interference rejection.

In music, the strongest frequency present might be an overtone of
the musical pitch.  Autocorrelation can help determine if this is
the situation.  But if one already knows the approximate pitch, one
can measure the dominant frequency and divide that down to get more
precise pitch information.

IMHO. YMMV.
-- 
rhn A.T nicholson d.0.t C-o-M

Reply by robert bristow-johnson ●February 25, 20062006-02-25

Bob Monsen wrote:
> On Sat, 25 Feb 2006 06:59:14 -0800, robert bristow-johnson wrote:
> ...
> > these "FFT methods" will suffer the same problems of separating
> > frequency components coming from different voices, and if you're not
> > careful, they have other problems (such as getting the pitch wrong when
> > there is a "missing fundamental" which happens for some musical tones).
>
> The OP gets to select the snippet of audio that he is interested in, so he
> can select a passage in which
>
> A) the singing is just beginning
> B) there are no other voices, other than the section he is interested in
> C) there is only a single note, which the choir section is trying to hit
>
> He wants to find out how the section converges (or not) on the note.
>
> So, not being an expert, perhaps I'm missing some crucial point here,
> but why can't he just use an overlapping (or maybe even non-overlapping)
> set of sequential FFTs to track the frequency changes?

which frequency?  it's unlikely that there is a single "bump" in the
spectra to track.  the "fundamental frequency" thatthe OP wants to
track is one frequency of many in a tone that is not a single sinusoid.

i'm not saying that *you're* doing this, Bob, but just for the purpose
of clarity, i want to point out that IMO the FFT is appealed to a lot
to solve whatever frequency or pitch problem that crops up and it's a
tool or weapon that does little effectively unless it's wielded well.
so you tale a little snippet of audio (maybe you window it) and pass
that to an FFT, you get the complex output, bitreverse, and compute the
magnitude or magnitude-squared (or log of it) and look at it.  fine,
now what are you (or the automated process that you're writing code
for) going to do with it?

if you pick the frequency of the bin of the maximum amplitude (ignoring
interpolation issues for the moment), that might be useful, but it
might be a harmonic, not the fundamental.  especially for vocals when
the singer's mouth is a highly resonant chamber, like when they sing
"oooooo" and move the tongue around, the most resonant frequency is
changing, even when the pitch (or fundamental frequency) is not.

for a single monotonic note, a legitimate method some people might take
care of this is to sample the spectrum magnitude at equally spaced
points (as possible harmonics) and add that up, and the spacing that
results in the maximum amplitude or energy would be the resultant
fundamental frequency.

but think about that for a moment: that is like multiplying the
spectrum by the frequency response of some kinda comb filter.  so now
think about what this comb filter would be like in the time domain
(it's the input added to a delayed version of the input).  if you toss
in a little scaling (1/2) and flip the thing over (subtract it from 1),
you have effectively something that looks a lot like the AMDF or ASDF.

sometimes the simplest implementation is, well, the simplest.

r b-j

Reply by robert bristow-johnson ●February 25, 20062006-02-25

Ron N. wrote:
>
> It sounds like the OP is a music teacher who is present at the
> time of the recordings.  Therefore he probably has access to
> a printed score sheet, and thus can calculate the approximate
> pitch represented by the note that a particular part is supposed
> to be singing at a particular point in the score (assuming the choir
> isn't way out of tune of course).  That will solve the a priori
> problem you mention.

i would think that the comb filter would need to be precisely tuned (A
= 440 Hz is not good enough when the person is actually singing 443
Hz).  so a real pitch tracker (that might be informed by the note
they're *supposed* to be singing or perhaps not) that results in a
pretty precise value (precision is not necessarily accuracy, but we
would like to hope that it's also accurate), to define the precise
delay (using fractional sample interpolation) of the comb filter, might
be needed to really knock out that tone.  so we need to know the
precise frequency of the tone so we can tune a comb filter to extract
it so we can accurately measure the frequency of it.

now if they're separately miced and recorded on separate tracks, each
pitch detector applied to each track can be assured that they're
looking at a single tone or voice with no need to separate them.  the
*only* frequency components (of significant energy) that they're
looking at are the harmonics of a *single* fundamental.  it is a
quasi-periodic function and pitch trackers can work pretty well on that
kind of input.

> If the other parts aren't singing octaves, why wouldn't a notch filter
> over the other parts,

you have to know what those other parts are, a priori, to tune the
notch filter(s).

> or a bandpass filter around the dominant
> overtone related to the pitch of interest, help reject interference
> to an autocorrelation?

well, pitch shifting using the phase vocoder is pretty blind to the
actual pitches in the input (which is one reason it works for complete
mixes), but time-domain pitch shifting of multiple voices, polyphonic
pitch extraction, and source or instrument separation are still rather
unsolved problems in the state of the art of audio DSP at the moment.
i remember a while ago Rob Maher did a paper in separating duet
recordings.  this algorithm only had to worry about *two* separate
tones (and the interlacing and interaction of two sets of harmonics)
and even so, the problem was a female canine (that didn't always work).
 with more voices (than 2), i imagine it would be a hideous monster.

r b-j

Reply by dies...@yahoo.ca ●February 27, 20062006-02-27

Hello!

I'm really sorry this reply comes in so late, I did not realise this
newsgroup would move so fast. Well, better late than never.

Ron N. wrote:
> Bob Monsen wrote:
> > So, not being an expert, perhaps I'm missing some crucial point here,
> > but why can't he just use an overlapping (or maybe even non-overlapping)
> > set of sequential FFTs to track the frequency changes?
>
> It looks like it's the standard problem of resolution in frequency
> versus resolution in time.  To get more resolution in frequency from
> an FFT requires a longer FFT window, which, without zero padding,
> results in less resolution in time, and/or might even require a
> buffer longer than the sound of interest which would introduce
> outside time-domain interference.

Yes, that's exactly what I meant. If I want more accuracy in frequency,
I need a longer window and resolution in time decreases. A few points
of this project that I could not make clear enough:

1) Yes, I have the whole choir on one track. Or actually two, but it
being in stereo does not help at all. So this is a real challenge. I
have already recorded what I'm about to record (over a year ago) so now
I'm just wondering how to get best results out of it. I did ponder
quite a bit about wether to use multi-track recording systems but then
I decided against it. As Richard Dobson wonderfully said, "a capella
choral singing is an extra-ordinary socio-acoustic phenomenon". So I
thought that if I took the choir into a studio and tried to record
voices separately, I would lose the basic essence of the phenomenon I'm
trying to study. I mean the situation would be so far from a real
choir-singing event that it would not be worth very much to me. I also
thought about things being easier if I had 4 or 8 singers instead of
35, but that kind of "barbershop" music is not what I'm interested in.
I'm fascionated about the way 35 singers can work like a single
instrument without anyone [inside] ever thinking about how 35 throats,
70 ears and 35 brains can possibly do anything together.

2) Our fft method this far consists of taking a series of 0.2 sec
slices (8820 samples) at 0.1 sec intervals (50% overlapping?), running
them through a hanning window, zeropadding with a million zeros and
plotting the result with frequency and amplitude. We have also made 3D
plottings of all the slices (time, freq, amp) and it works nicely but
now I'm more interested in getting the actual data out of there.

3) The autocorrelation thingy we tried didn't work because it seemed to
be highly vulnerable to noise. We knew that it was only supposed to
work if I had voices on separate tracks, but we thought we might check
if we could use very heavy filtering and isolate the fundamentals that
we are trying to measure (measure, not "find" since I know what's being
sung there, so I know with 5% accuracy where those fundamentals are).
So before trying any filtering we built an autocorrelation system and
tested it with a 440Hz matlab-made sinus waveform. It was correct and
accurate. Then we thought to test the extremely noisy scenario; we made
a waveform that had the sum of 400Hz and 500Hz with equal amplitudes
and fed it to the beast. As a result we got a very accurate, nice and
clean 441.xxx Hz. I'm not sure of the actual result, but something like
that. What scared us off was the fact that the result was as clean as
it would have been if there had been a single noiseless wave of that
frequency. All the noise was just gone. Nothing suggested that anything
was wrong with the result. I know our noise example was an extreme one,
but it revealed that power in other frequencies than the one being
measured will pull the result in the direction of the noise, and that's
the last thing we might want. Is this how autocorrelation is supposed
to work or did we do something wrong? I have some real noise issues
there also, because it was not a studio recording, not least of which
is a DC brumm in the left channel, but I think I'll handle that one by
filtering out everything below 55Hz (there's not anything I'm
interested in down there).

> In music, the strongest frequency present might be an overtone of
> the musical pitch.  Autocorrelation can help determine if this is
> the situation.  But if one already knows the approximate pitch, one
> can measure the dominant frequency and divide that down to get more
> precise pitch information.

I wonder how often that situation accurs. I'm not sure if you mean the
situation where a person singing alone (or many people singing the same
note) produces a sound where an overtone is dominant, or a situation
where that comes from different voices' overtones boosting each other,
for example to people singing a fifth apart at 200 and 300 Hz and a
common overtone of 600Hz being dominant.

The situation what I'm worried about is when an upper voice is singing
a pitch that is simulaneously an overtone of a lower voice. Let's say
bases sing 110Hz (A) and altos sing 330Hz (E an octave and a half
higher). And let's say tuning is not perfect. Now there would be a peak
at 330Hz that's the sum of altos and bases. How much would the bases'
overtone have effect on the peak at 330Hz? Let's say that bases sing
108Hz making their 3rd partial tone 324Hz. If I would try to study it
with a non-zeropadded fft I would get such a wide "hill" from the
altos' fundamental (330Hz) that there would be quite a lot of power
shown also at 324Hz. Now if we add the power from bases to that I
believe 324Hz would become the peak instead. I'm hoping to tackle that
by zeropadding (making the peaks steeper) but I'm not sure if it works.
And it would be nasty towards the singers to assume that "if the upper
pitch is in perfect harmony, it must be an overtone" :>.

Ron Nicholson wrote:
> As you found, zero-padding and using a long fft, although a very
> accurate method of interpolating frequency, will not show fine
> detail in the frequency envelope.  However frequency is the
> derivative of phase.  So what I might try is a technique from
> phase vocoding.  Use overlapped successive short fft's and
> compare the phase changes in the nearest bin of interest with
> what would be the phase change represented by the overlap
> offset.  Plot that phase difference.  The slope of the plot will
> represent the frequency offset from the fft bin center, and any
> curvature in the plot will represent a change in frequency.
>
> This could work with fft windows as short as maybe a dozen
> cycles or less of the dominant frequency (which itself may be
> an overtone of the fundamental pitch), so you can get much
> better time resolution.  For 330 Hz, maybe try 75% overlapped
> windows as short as maybe 1024 samples of 44.1 KHz.

Wow.. would this also work with my case where all the voices are on the
same track or for individual voices only? I have not yet figured out
what you mean by this but I will, and anyway to understand the answer
is on askers responsibility. One key question: What do you mean by
"nearest bin of interest"? (or rather, what does a "bin" mean in this
context?) A stupid question, perhaps, but I know only know only as many
things about signal processing as I have come across in this project,
and also it was my friend who wrote the actual code although I know
what the code does. I can read it but not yet write :)

Richard Dobson wrote:
> Excellent though Matlab may be for audio analysis, you may find that the
> advanced tools designed specifically for analysing musical audio may suit your
> purposes better. For example, the most comprehensive system around at the moment
> seems to be the CLAM suite of tools (all GPL with sources, but full binary
> installer-based packages are now available) from University Pompeu Fabra:

Thanks a lot!! I will check that out throughly. And thank you for the
insights on more musical matters! I didn't realise that singers
approach the tones from below, although I have practical experiense on
the matter. I mean, when you mentioned it I realised that's just what I
do. I have a pianist backround so I tend to consider harmony a
"vertical" thing. When I sing in a choir, I don't think very much about
succesive intervals being pure, but instead I compare my voice with
other simultaneously occuring voices. I surely don't assume that equal
tuning would be the basis for choirs, instead I'd like to really find
out what is the amateur reality of tuning; by researching the practise
I'm trying to find out what is the hidden ideal of tuning that they are
trying to achieve, if there is one. There is one paradox that I'm
particularly interested in, and this goes way out of topic of DSP:

Lets say choir sings in C major a following typical cadence with very
slow chords: C, F, dm7, G7, C or in other words I - IV - II7 - V7 - I.
The F major is tuned so that F and C make a perfet 5th and the A is a
pure major 3rd above F (a 4:5:6 major chord). In dm7 the D in
introduced - it will go to perfect 5th below the A while other notes
remain untouched. Now we go to G7 where singers of the D note will hold
their pitch from previous chord. G is tuned by the D and while doing
so, it will become lower than it was in the beginning, by a syntonic
comma (80:81). I know there are solutions to this but what I think is
weird about this phenomenon is that everybody are singing in perfect
harmony and our tonic is falling. I'd like to know how this thing is
dealt with "in real life".

I must sadly confess (to Ron N. and Robert Bristow-Johnson) that I
actually didn't understand (yet) much of your later conversation, and
therefore I fail at trying to comment it, although I would very much
like to.
Still one more question: It's usually recommended to use zeropadding by
something like the samples length of zeros. What effect does outrageous
zeropadding (a million zeros after a 8820 long sample) have on fft's
reliability? Would it be an issue that would ruin our results? It
certainly makes results more accurate, but would that just mean adding
decimals to a guess or would it really be more accurate?

Thanks very much to all of you who replied!

Erkki Nurmi

Reply by Ron N. ●February 27, 20062006-02-27

dieselviulu@yahoo.ca wrote:
> Ron Nicholson wrote:
> > In music, the strongest frequency present might be an overtone of
> > the musical pitch.  Autocorrelation can help determine if this is
> > the situation.  But if one already knows the approximate pitch, one
> > can measure the dominant frequency and divide that down to get more
> > precise pitch information.
>
> I wonder how often that situation accurs. I'm not sure if you mean the
> situation where a person singing alone (or many people singing the same
> note) produces a sound where an overtone is dominant,

I think a male bass voice can easily produce more overtone than
fundamental pitch energy.  (Or maybe it's just the cheap microphones
I've been using which roll off everything in the low register.)  I
usually measure the strongest overtone because that's where the
S/N ratio is better, then divide down to get the pitch (which might
be inferred by ear, or perhaps by rough autocorrelation or cepstral
methods).

> Ron Nicholson wrote:
> > As you found, zero-padding and using a long fft, although a very
> > accurate method of interpolating frequency, will not show fine
> > detail in the frequency envelope.  However frequency is the
> > derivative of phase.  So what I might try is a technique from
> > phase vocoding.  Use overlapped successive short fft's and
> > compare the phase changes in the nearest bin of interest with
> > what would be the phase change represented by the overlap
> > offset.  Plot that phase difference.  The slope of the plot will
> > represent the frequency offset from the fft bin center, and any
> > curvature in the plot will represent a change in frequency.
> >
> > This could work with fft windows as short as maybe a dozen
> > cycles or less of the dominant frequency (which itself may be
> > an overtone of the fundamental pitch), so you can get much
> > better time resolution.  For 330 Hz, maybe try 75% overlapped
> > windows as short as maybe 1024 samples of 44.1 KHz.
>
> Wow.. would this also work with my case where all the voices are on the
> same track or for individual voices only? I have not yet figured out
> what you mean by this but I will, and anyway to understand the answer
> is on askers responsibility. One key question: What do you mean by
> "nearest bin of interest"? (or rather, what does a "bin" mean in this
> context?)

I'm not sure if any analysis can track an individual voice inside
a large choral mix.

An FFT produces output for discrete frequencies (multiples of the
sample rate divided by the FFT length).  Not sure if it's standard
usage, but I call a bin either one of those discrete frequencies or
a window of frequencies with the width equal to the distance between
the discrete frequencies.  Most of the energy of a particular frequency
will show up in in nearest discrete frequency bin, even if not exactly
equal to one of the bin center frequencies.  So you'd choose that bin
for phase analysis across FFT frames.

Given a 50% overlap, any exact even-numbered bin frequency should
have the same phase in successive frames, any odd-numbered
one the phase inverted, and any deviation from this likely represents
a frequency not exactly at the bin center or a changing pitch
(vibrato or glissando, etc.).

> Still one more question: It's usually recommended to use zeropadding by
> something like the samples length of zeros. What effect does outrageous
> zeropadding (a million zeros after a 8820 long sample) have on fft's
> reliability? Would it be an issue that would ruin our results? It
> certainly makes results more accurate, but would that just mean adding
> decimals to a guess or would it really be more accurate?

Zero-padding an FFT is almost exactly the same as interpolating
very smoothly between the data points resulting from an un-padded
FFT.  More zeros mostly adds more smoothness.  Any ripples
theoretically resolvable by zero-padding interpolation are most
likely hidden in quantization noise.  Others here have posted that
a 3 point parabolic curve fit gives good results at a much lower
computational cost.

Thanks for bringing an interesting question to comp.dsp.

IMHO. YMMV.
-- 
rhn A.T nicholson d.0.t C-o-M

Reply by Ron N. ●March 1, 20062006-03-01

dieselviulu@yahoo.ca wrote:
> Still one more question: It's usually recommended to use zeropadding by
> something like the samples length of zeros. What effect does outrageous
> zeropadding (a million zeros after a 8820 long sample) have on fft's
> reliability? Would it be an issue that would ruin our results? It
> certainly makes results more accurate, but would that just mean adding
> decimals to a guess or would it really be more accurate?

I just tried this on some synthetic sinusoids: zero-padded 8192
windowed samples up to an FFT vector about a million samples
long, which did give me better results.  But the much longer
FFT is a lot slower; and the frequency accuracy didn't seem
any better than those produced by phase vocoder analysis
on successive and much shorter 1024 sample long FFT frames.
The latter phase vocoder method has very low cost beyond that
of the FFT, other than that of storing phase information from
the previous FFT frame of course.  Plus the shorter frames
allow clearly better time resolution of any frequency changes.

But if you have enough compute power that efficiency doesn't
matter, your outrageous zero-padding probably won't do
anything that would ruin your results (assuming you can get
any interesting results from a large chorus recorded with a
single mic).

IMHO. YMMV.
-- 
rhn A.T nicholson d.0.t C-o-M

Previous 12Next

Analyzing fundamental frequencies from musical signals

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group