DSPRelated.com
Forums

Analyzing fundamental frequencies from musical signals

Started by dies...@yahoo.ca February 24, 2006
Hello!

I'm a soon (hopefully) graduating music teacher who's quite lost in the
general field of signal processing. I'm trying to write my graduating
work about intonation in amateur choir music. My big question in my
work is "How does an amateur choir fine-tune itself when singing a
cappella (without accompanying instruments)?". In order to reliably
answer that question I need to be able to analyze the tones the people
are singing. For my purposes I have estimated that I'll need accuracy
of 0,1% of the frequency. Lowest absolute amounts of acceptable error
would be about 0.066Hz when the bass sings (very) low C, about 66Hz.
But that's the extreme situation, I can cope with less.

I'm doing my work on matlab, and this far I've been simply using fft.
I've been doing this with a friend who is an engineer, while I'm more
of a musician. I'm very good in mathematics and computing for a music
teacher, but I'm no engineer :). We played with fourier quite a bit,
and found out that it's not very accurate frequency-wise for this
purpose. Then we made quite a bit of progress by using extreme
zero-padding; it was in the scale of adding 1e6 zeros after a sample
with length 8820 points. After that we used contour function to make
our output plot more accurately readable. It provided accurate results
from synthetic signals, but I'm not at all sure that it would be
accurate and reliable on real audio signals. I have a gut feeling that
using that much zero-padding would not be riskless.

We also tried using something that my friend called "autocorrelation"
but it didn't work out. We cancelled that attempt when we got results
that were very nice, clean, accurate and totally incorrect.

The material with which I'm working is live recordings from choir
rehersals and concerts, recorded on DAT with one stereo microphone,
samplerate 44100, 16 bits. Signal is quite noisy, and the number of
frequency peaks that I want to catch varies between 4 and 8. Frequency
range that I want to study is about from 66Hz (lowest bases) to 1KHz
(highest sopranos), although the upper limit could be much higher if it
turns out that I need the harmonics too.

We've been using sample length of 0.2 seconds in fft. The things that I
don't like in fft are:

1) When analyzing a synthetic signal which consists of one clean sine
wave of 440Hz, I get a plot that shows me quite a wide slope with peak
at 440Hz. This means that in the audio signal I cannot tell if for
example all the altos are singing the same tone as they should, or is
there diversity [1] within the altos.

2) fft cannot analyze any changes within the time of sample length.
This means that in the two cases where a) bases all sing a bit
differrent tone making their voice rather a band than a tone, or b)
bases sing the same note, but their fine tuning slides downwards within
the sample length I get same kind of results for both cases: a wider
peak.

3) I have to consider all things that are related to harmonics manually
/ visually. This is not impossible, though, since I'm one of the
mentioned bases and I know the music they are singing. This means that
I'm not looking for an audio-to-notation algorithm, since I do have the
sheet music. If I have to do this with only "tuned-up fft" I'd rather
lose the harmonics and study only fundamental tones.

So, is there a better way to do this kind of analysis? I've heard about
MUSIC algorithm, but I don't know anything about it. My ideal system
would present me a plot of 10 seconds of music with 4-8 curves that
represent the fundamental tones of differrent voices (high soprano,
lower soprano etc.) and how their frequency changes in time (so, a plot
with time as x and frequency as y). It would also (not obligatory if
it's otherwise reliable and accurate) present me with a chance to see
some kind of power-graph of the signal to see how much diversity there
is within one voice. I have a gut feeling that this *should* be
possible since we do that with our ears all the time. Also, all needed
information is in the signal, if I just can get it out because I can
hear a lot of things in the recording. I know some of the reasons why
fft-based analysis cannot give me a spectrum of a "moment" of music
(sample length approaching zero), but I believe ears can so it should
be possible somehow.

I would be very glad about suggestions! I know it's a lot to ask, but
please bare in mind that I'm not actually a mathematician and that
trying out something new in matlab takes several days before I can say
anything about it... Imagine writing a letter with a dictionary in a
language you know almost nothing about :). So if a light bulb lights up
in your head and you know just the thing that works, and why it works,
I'd love to hear that!

Happy processing!

Erkki Nurmi
Sibelius academy, Finland

[1] by diversity within a voice I mean a situation where a loud
"leading" alto is singing a note of 300Hz, and another loud alto in the
other end of the row of altos is singing 295Hz. All the other altos
sing (with varying amplitudes) something between those two frequencies.
This is not a rare situation.

Wow, that's a lot of technical jargon for a "graduating music teacher"!

I must dissappoint you: your problem is intractable. While tracking a
fundamental frequency of one voice is a solvable (but not easy)
problem, tracking separate F0s of many voices in the choir recording is
just impossible, at least with present state-of-the-art.
Btw, FFT is NOT the right tool for reliably tracking one voice, as you
have already noticed...
And why auto-correlation didn't work ?

The most general and robust solution for tracking F0 of one signal is
described in the US Patent Application 20030088401
(www.uspto.gov/patft)
The ICASSP paper and Matlab demo are available from
http://www.soundmathtech.com/pitch

Also, our audiitory system is vastly more complex than just a bunch of
simple filters and FFTs...
As somebody (Georg von Bekesy ?) once said:
"Dead cats and Fourier transforms have harmed hearing science more than
anything else".

The only practical suggestion I can make is this:
If you really want to study this problem in detail you need to provide
a separate mic to each singer in the choir (and do your own recording
session).

dieselviulu@yahoo.ca wrote:
> > I'm a soon (hopefully) graduating music teacher who's quite lost in the > general field of signal processing. I'm trying to write my graduating > work about intonation in amateur choir music. My big question in my > work is "How does an amateur choir fine-tune itself when singing a > cappella (without accompanying instruments)?". In order to reliably > answer that question I need to be able to analyze the tones the people > are singing.
are you micing and recording each voice separately? you're not trying to estimate pitches of vocals that are all mixed together, are you? ...
> We also tried using something that my friend called "autocorrelation" > but it didn't work out. We cancelled that attempt when we got results > that were very nice, clean, accurate and totally incorrect.
then you didn't do it right. consider a closely related method that is also very old: average magnitude difference function (AMDF) or average squared difference function (ASDF). as with any method, you may have to worry about very low amplitude sub-harmonics, that might make a 260 Hz quasi-periodic waveform to appear like a 130 Hz waveform. other than these "octave errors" (and there are ways to recognize and mitigate them), these correlation methods should work fine. r b-j
On Fri, 24 Feb 2006 03:56:51 -0800, dieselviulu@yahoo.ca wrote:

> Happy processing! >
You seem to want to determine how a spectrum changes over time. You want to distinguish between frequencies that are 0.1Hz apart in the middle range, and see these changes on a moment to moment basis. I'd say you want to break your samples into 'small' blocks, window the blocks using something like a hamming window to prevent frequency aliasing of the discontinuities at the ends of the samples, and plot the output on a 3d plot, where block number is x, frequency is y, coefficient amplitude is z. In order to get 0.1Hz resolution, you need to zero pad the small blocks to about 250k samples before you do your FFT. If you used 441 sample long blocks, you could get 20 intervals. If things are changing over your 0.2s, you might easily see it happening with this scheme. -- Regards, Bob Monsen Everything should be made as simple as possible, but not simpler. Albert Einstein (1879 - 1955)
This is about analyzing fine detail in pitch and pitch changes
over time, given that you already know approximately what note
is being sung or played, and that any interfering tones are
at least a minor third or more away.

Note that the strongest frequency might be an overtone of
the fundamental pitch, so you might have to divide down the
strongest frequency to find the pitch relative to the notes
on the score sheet.

As you found, zero-padding and using a long fft, although a very
accurate method of interpolating frequency, will not show fine
detail in the frequency envelope.  However frequency is the
derivative of phase.  So what I might try is a technique from
phase vocoding.  Use overlapped successive short fft's and
compare the phase changes in the nearest bin of interest with
what would be the phase change represented by the overlap
offset.  Plot that phase difference.  The slope of the plot will
represent the frequency offset from the fft bin center, and any
curvature in the plot will represent a change in frequency.

This could work with fft windows as short as maybe a dozen
cycles or less of the dominant frequency (which itself may be
an overtone of the fundamental pitch), so you can get much
better time resolution.  For 330 Hz, maybe try 75% overlapped
windows as short as maybe 1024 samples of 44.1 KHz.

This fft phase technique differs from autocorrelation in that
autocorrelation requires some interpolation to find the phase
to some given resolution, if finer than that of one sample step.

As for multiple voices at differing frequencies, I recommend
multiple microphones and a multi-channed recorder.

IMHO. YMMV.

------Original Message------
dieselviulu@yahoo.ca wrote:

> I'm a soon (hopefully) graduating music teacher who's quite lost in the > general field of signal processing. I'm trying to write my graduating > work about intonation in amateur choir music. My big question in my > work is "How does an amateur choir fine-tune itself when singing a > cappella (without accompanying instruments)?". In order to reliably > answer that question I need to be able to analyze the tones the people > are singing. For my purposes I have estimated that I'll need accuracy > of 0,1% of the frequency. Lowest absolute amounts of acceptable error > would be about 0.066Hz when the bass sings (very) low C, about 66Hz. > But that's the extreme situation, I can cope with less. > > I'm doing my work on matlab, and this far I've been simply using fft. > I've been doing this with a friend who is an engineer, while I'm more > of a musician. I'm very good in mathematics and computing for a music > teacher, but I'm no engineer :). We played with fourier quite a bit, > and found out that it's not very accurate frequency-wise for this > purpose. Then we made quite a bit of progress by using extreme > zero-padding; it was in the scale of adding 1e6 zeros after a sample > with length 8820 points. After that we used contour function to make > our output plot more accurately readable. It provided accurate results > from synthetic signals, but I'm not at all sure that it would be > accurate and reliable on real audio signals. I have a gut feeling that > using that much zero-padding would not be riskless. > > We also tried using something that my friend called "autocorrelation" > but it didn't work out. We cancelled that attempt when we got results > that were very nice, clean, accurate and totally incorrect. > > The material with which I'm working is live recordings from choir > rehersals and concerts, recorded on DAT with one stereo microphone, > samplerate 44100, 16 bits. Signal is quite noisy, and the number of > frequency peaks that I want to catch varies between 4 and 8. Frequency > range that I want to study is about from 66Hz (lowest bases) to 1KHz > (highest sopranos), although the upper limit could be much higher if it > turns out that I need the harmonics too. > > We've been using sample length of 0.2 seconds in fft. The things that I > don't like in fft are: > > 1) When analyzing a synthetic signal which consists of one clean sine > wave of 440Hz, I get a plot that shows me quite a wide slope with peak > at 440Hz. This means that in the audio signal I cannot tell if for > example all the altos are singing the same tone as they should, or is > there diversity [1] within the altos. > > 2) fft cannot analyze any changes within the time of sample length. > This means that in the two cases where a) bases all sing a bit > differrent tone making their voice rather a band than a tone, or b) > bases sing the same note, but their fine tuning slides downwards within > the sample length I get same kind of results for both cases: a wider > peak. > > 3) I have to consider all things that are related to harmonics manually > / visually. This is not impossible, though, since I'm one of the > mentioned bases and I know the music they are singing. This means that > I'm not looking for an audio-to-notation algorithm, since I do have the > sheet music. If I have to do this with only "tuned-up fft" I'd rather > lose the harmonics and study only fundamental tones. > > So, is there a better way to do this kind of analysis? I've heard about > MUSIC algorithm, but I don't know anything about it. My ideal system > would present me a plot of 10 seconds of music with 4-8 curves that > represent the fundamental tones of differrent voices (high soprano, > lower soprano etc.) and how their frequency changes in time (so, a plot > with time as x and frequency as y). It would also (not obligatory if > it's otherwise reliable and accurate) present me with a chance to see > some kind of power-graph of the signal to see how much diversity there > is within one voice. I have a gut feeling that this *should* be > possible since we do that with our ears all the time. Also, all needed > information is in the signal, if I just can get it out because I can > hear a lot of things in the recording. I know some of the reasons why > fft-based analysis cannot give me a spectrum of a "moment" of music > (sample length approaching zero), but I believe ears can so it should > be possible somehow. > > I would be very glad about suggestions! I know it's a lot to ask, but > please bare in mind that I'm not actually a mathematician and that > trying out something new in matlab takes several days before I can say > anything about it... Imagine writing a letter with a dictionary in a > language you know almost nothing about :). So if a light bulb lights up > in your head and you know just the thing that works, and why it works, > I'd love to hear that! > > Happy processing! > > Erkki Nurmi > Sibelius academy, Finland > > [1] by diversity within a voice I mean a situation where a loud > "leading" alto is singing a note of 300Hz, and another loud alto in the > other end of the row of altos is singing 295Hz. All the other altos > sing (with varying amplitudes) something between those two frequencies. > This is not a rare situation.
IMHO. YMMV. -- Ron Nicholson rhn A.T nicholson d.0.t C-o-M
robert bristow-johnson wrote:
> dieselviulu@yahoo.ca wrote: > > We also tried using something that my friend called "autocorrelation" > > but it didn't work out. We cancelled that attempt when we got results > > that were very nice, clean, accurate and totally incorrect. > > then you didn't do it right.
Or choral music will usually have "unwanted noise" (e.g. the other voices filling out some chord) which is strongly correlated to the some submultiple of the pitch of interest, rather than uncorrelated white noise. So simple autocorrelation, without some sort of prefiltering, might not be the best way to reject this type of "noise". IMHO. YMMV. -- rhn A.T nicholson d.0.t C-o-M
dieselviulu@yahoo.ca wrote:

> Hello! > > I'm a soon (hopefully) graduating music teacher who's quite lost in the > general field of signal processing. I'm trying to write my graduating > work about intonation in amateur choir music. My big question in my > work is "How does an amateur choir fine-tune itself when singing a > cappella (without accompanying instruments)?".
... This is indeed a very interesting question! It's relevance extends beyond signers and choirs even to instrumental performance. What follows may only be of marginal interest for dsp people, but anyway: Excellent though Matlab may be for audio analysis, you may find that the advanced tools designed specifically for analysing musical audio may suit your purposes better. For example, the most comprehensive system around at the moment seems to be the CLAM suite of tools (all GPL with sources, but full binary installer-based packages are now available) from University Pompeu Fabra: http://www.iua.upf.es/mtg/clam/ The reason I suggest this is that for truly accurate pitch analysis you need what is called partial or peak tracking, which starts with a pvoc-style FFT analysis (overlapping frames), and then searches for spectral peaks in each frame and derives from these a set of amplitude/frequency tracks. The original system for doing this is called "McAulay-Quatieri" (MQ) analysis, after the authors who first presented the idea. Methods of peak detection and tracking have evolved over the years, but the principle is the same. You will need to define your analysis pitch accuracy in terms of Cents (1200 to the octave); the difference between a perfect 5th and a 12tone equal tempered 5th is only ~2 Cents, which is not a lot! But if you can get an accuracy of say 10 Cents (0.1 of a tempered semitone) you will still be sure of useful data. Less than that and you will start to miss relevant features. You will want to search for a number of significant features in the analysis (which as others have pointed out really does need to be per voice - ideally per individual singer). In particular - singers "hunt" for the required pitch by means of tiny (micro) adjustments after commencing the note, usually from below. If you can detect these (usually only a fraction of a second), you should be able to identify which voices are leading, and which are following. It also depends on whether the sung not is "mobile" or note (see below). You can in general expect to find small wobbles at the start of many notes as singers hunt for the pitch that "sounds right" (this diminishes the better the pitch memory, so the degree of leading wobble can be a measure of choir competence). Violinists are famous for tuning strings by approaching upwards with the peg from well below the note. Much like trying to get a camera lens in focus - the sensation is directly comparable. Assuming by "amateur" you mean untrained or inexperienced singers, the primary issue is that of short/medium term pitch memory. Assuming tonal harmony-based music, musicians do not "simply" employ relative pitch knowledge (sequential) to pitch notes in sequence, but also rely on solfege, to ensure for example that the pitch of tonic notes is maintained. So one analysis to try is to identify all instances of the tonic note (for music that is predomintantly or totally in one key), and detect the degree of deviation from the starting pitch, and any trend over time. Someone who claims to be "tone-deaf" is usually saying (not necessarily correctly) that they have little or no pitch memory. I am excluding from this those who have perfect pitch - though I will note that this does not of itself guarantee accurate tuning. Untrained singers typically use insufficient effort when singing ascending intervals, so that a rising interval may be too narrow (upper note flat), but the same descending interval will be too large (lower note flat). Thus, for example, simply singing a minor third up and down again can result in the third note (which with good pitch memory should be a repeat of the first) has dropped in pitch. The speed of the sequence clearly tests pitch memory. This is a key reason why choirs can drop in pitch progressively during a performance. Music in minor keys is the most likely to demonstrate this problem, as it is not usually appreciated just how wide a minor third needs to be. The next stage in analysis is to relate pitch selection to the almost black-art dance between just-intonation ("pure" thirds, interval = 5/4 for the major third, or ~386 Cents, where 400 Cents = the equal tempered interval) and Pythagorean tuning based on ascending perfect fifths (major third much wider, = 81/64 or 408 Cents). For many musicians this is done purely instinctually (and thus not always reliably!). In the major scale, the non-"perfect" intervals (2nd, 3rd, 6th, 7th) are mobile, as I call them. I therefore use the notion of a distinction between a "harmonic" and a "melodic" major scale. In the harmonic scale, the mobile notes are tuned "just"; pure (narrow) thirds; the leading note is a pure 3rd above the dominant. In a melodic scale, wider Pythagorean intervals are sung, a wide third, and especially a deliberately sharper 7th "leading-note" ascending to the tonic above. This can be done consciously for expressive effect. Those holding harmony notes (especially the third in the triad) will tend to tune "pure", whereas those singing the melody will tend to favour wider intervals Pythqagorean-style. The catch is that (by analogy with the melodic minor), singers may sing sharp 3rds, 6th and 7ths ascending (i.e Pythagorean), and the same notes flatter (closer to "just") on the way down, while (the crucial point), preserving the pitch of the tonic, 4ths and 5ths. This clearly requires solid tonic pitch memory. If this goes wrong, again the tendency is usually for the tuning to go flat over time; but occasionally singers can hit a really sharp note that really throws things! The point of all this is to avoid the temptation to use equal temperament as the measure of tuning success. No competent musician uses that if they have any choice in the matter! Each choir, if well trained, develops a sort of tuning style, so that everyone is used to how the thirds, say, are tuned. I can imagine you might be able to evaluate a choir technically by accumulating data on interval preservation over time, and finding the statistical deviation from a norm - the narrower the peak of the curve, the more consistent the pitching, and therefore the "better" the choir. But because of the legitimate mobility of some intervals, the data must be weighted according to interval. Whereas choirs who sing intervals haphazardly, up and down, should show a wider deviation. Still, so long as the tonic pitch remains stable, the choir has done pretty well. There are many detailed issues that can be investigated, such as the width of semitones ascending and descending (7th/tonic, 4th/3rd); they can be varied expressively in conjunction with strong tonic memory, or they can become the most sure source of errors. And of course it helps enormously if the singers do not use vibrato. If you have access to a barbershop quartet, the tuning styles they use are most distinctive, and, from what I can tell (I am no congnoscento of barbershop, though I enjoy hearing it from time to time), predominantly favouring "pure" just intervals. So I would guess that an "amateur" quartet could be a most fruitful subject for study. A capella choral singing is an extra-ordinary socio-acoustic phenomenon (if that term does not already exist, I have just invented it), as tuning is such a collaborative often leaderless and unconscious process. Richard Dobson
Richard Dobson wrote:
> dieselviulu@yahoo.ca wrote: > >> Hello! >> >> I'm a soon (hopefully) graduating music teacher who's quite lost in the >> general field of signal processing. I'm trying to write my graduating >> work about intonation in amateur choir music. My big question in my >> work is "How does an amateur choir fine-tune itself when singing a >> cappella (without accompanying instruments)?". > > > ... > > This is indeed a very interesting question! It's relevance extends > beyond signers and choirs even to instrumental performance. What follows > may only be of marginal interest for dsp people, but anyway: > > Excellent though Matlab may be for audio analysis, you may find that the > advanced tools designed specifically for analysing musical audio may > suit your purposes better. For example, the most comprehensive system > around at the moment seems to be the CLAM suite of tools (all GPL with > sources, but full binary installer-based packages are now available) > from University Pompeu Fabra: > > http://www.iua.upf.es/mtg/clam/ > > The reason I suggest this is that for truly accurate pitch analysis you > need what is called partial or peak tracking, which starts with a > pvoc-style FFT analysis (overlapping frames), and then searches for > spectral peaks in each frame and derives from these a set of > amplitude/frequency tracks. The original system for doing this is called > "McAulay-Quatieri" (MQ) analysis, after the authors who first presented > the idea. Methods of peak detection and tracking have evolved over the > years, but the principle is the same. You will need to define your > analysis pitch accuracy in terms of Cents (1200 to the octave); the > difference between a perfect 5th and a 12tone equal tempered 5th is only > ~2 Cents, which is not a lot! But if you can get an accuracy of say 10 > Cents (0.1 of a tempered semitone) you will still be sure of useful > data. Less than that and you will start to miss relevant features. > [snip]
This this looks as it might be useful in analyzing speech with view towards working on speech recognition problems. Your overview (snipped) and the sample screen shots seem to match how I visualize the problem. Will download later, have to go to work now ;[
Ron N. wrote:
> robert bristow-johnson wrote: > > dieselviulu@yahoo.ca wrote: > > > We also tried using something that my friend called "autocorrelation" > > > but it didn't work out. We cancelled that attempt when we got results > > > that were very nice, clean, accurate and totally incorrect. > > > > then you didn't do it right. > > Or choral music will usually have "unwanted noise" (e.g. the other > voices filling out some chord)
i would call that "interference" instead of "noise" and one reason i asked the OP if they micing each singer separately. i know that source separation is not a solved issue and that normally a pitch detector (no matter what PD algorithm) works on a monophonic tone (and outputs a single pitch parameter), not on polyphonic input. these "FFT methods" will suffer the same problems of separating frequency components coming from different voices, and if you're not careful, they have other problems (such as getting the pitch wrong when there is a "missing fundamental" which happens for some musical tones).
> which is strongly correlated to the > some submultiple of the pitch of interest, rather than uncorrelated > white noise. So simple autocorrelation, without some sort of > prefiltering, might not be the best way to reject this type of "noise".
i'm sorry Ron, but there isn't really a (blind) pre-filter that will make this work with polyphonic input where the singers' pitches are in the same ballpark (within an octave or two). i suppose a non-blind comb filtering that knocks out tones that the pitch detector is not focusing on, but that either becomes a circular problem (knowing the pitch a priori to tune the comb filter) and still does not solve the problem of initially determining the pitch of the first voice from the cacophony of all of the other voices. to do this right, you have to mic each singer separately and record them simultaneously. r b-j
On Sat, 25 Feb 2006 06:59:14 -0800, robert bristow-johnson wrote:

> > Ron N. wrote: >> [quoted text muted] > > i would call that "interference" instead of "noise" and one reason i > asked the OP if they micing each singer separately. i know that source > separation is not a solved issue and that normally a pitch detector (no > matter what PD algorithm) works on a monophonic tone (and outputs a > single pitch parameter), not on polyphonic input. > > these "FFT methods" will suffer the same problems of separating > frequency components coming from different voices, and if you're not > careful, they have other problems (such as getting the pitch wrong when > there is a "missing fundamental" which happens for some musical tones). > >> [quoted text muted] > > i'm sorry Ron, but there isn't really a (blind) pre-filter that will > make this work with polyphonic input where the singers' pitches are in > the same ballpark (within an octave or two). i suppose a non-blind > comb filtering that knocks out tones that the pitch detector is not > focusing on, but that either becomes a circular problem (knowing the > pitch a priori to tune the comb filter) and still does not solve the > problem of initially determining the pitch of the first voice from the > cacophony of all of the other voices. > > to do this right, you have to mic each singer separately and record > them simultaneously. > > r b-j
The OP gets to select the snippet of audio that he is interested in, so he can select a passage in which A) the singing is just beginning B) there are no other voices, other than the section he is interested in C) there is only a single note, which the choir section is trying to hit He wants to find out how the section converges (or not) on the note. So, not being an expert, perhaps I'm missing some crucial point here, but why can't he just use an overlapping (or maybe even non-overlapping) set of sequential FFTs to track the frequency changes? -- Regards, Bob Monsen We should take care not to make the intellect our god; it has, of course, powerful muscles, but no personality. Albert Einstein (1879 - 1955)