DSPRelated.com
Forums

Analyzing fundamental frequencies from musical signals

Started by dies...@yahoo.ca February 24, 2006
robert bristow-johnson wrote:
> i'm sorry Ron, but there isn't really a (blind) pre-filter that will > make this work with polyphonic input where the singers' pitches are in > the same ballpark (within an octave or two). i suppose a non-blind > comb filtering that knocks out tones that the pitch detector is not > focusing on, but that either becomes a circular problem (knowing the > pitch a priori to tune the comb filter) and still does not solve the > problem of initially determining the pitch of the first voice from the > cacophony of all of the other voices.
It sounds like the OP is a music teacher who is present at the time of the recordings. Therefore he probably has access to a printed score sheet, and thus can calculate the approximate pitch represented by the note that a particular part is supposed to be singing at a particular point in the score (assuming the choir isn't way out of tune of course). That will solve the a priori problem you mention. If the other parts aren't singing octaves, why wouldn't a notch filter over the other parts, or a bandpass filter around the dominant overtone related to the pitch of interest, help reject interference to an autocorrelation? IMHO. YMMV. -- rhn A.T nicholson d.0.t C-o-M
Bob Monsen wrote:
> So, not being an expert, perhaps I'm missing some crucial point here, > but why can't he just use an overlapping (or maybe even non-overlapping) > set of sequential FFTs to track the frequency changes?
It looks like it's the standard problem of resolution in frequency versus resolution in time. To get more resolution in frequency from an FFT requires a longer FFT window, which, without zero padding, results in less resolution in time, and/or might even require a buffer longer than the sound of interest which would introduce outside time-domain interference. Autocorrelation can be done with a sample window as short as maybe 2 periods of the pitch of interest. Phase vocoding is similar to an autocorrelation of the signal after prefiltering by a form of one-bin-wide bandpass filter, which helps a bit with noise and interference rejection. In music, the strongest frequency present might be an overtone of the musical pitch. Autocorrelation can help determine if this is the situation. But if one already knows the approximate pitch, one can measure the dominant frequency and divide that down to get more precise pitch information. IMHO. YMMV. -- rhn A.T nicholson d.0.t C-o-M
Bob Monsen wrote:
> On Sat, 25 Feb 2006 06:59:14 -0800, robert bristow-johnson wrote: > ... > > these "FFT methods" will suffer the same problems of separating > > frequency components coming from different voices, and if you're not > > careful, they have other problems (such as getting the pitch wrong when > > there is a "missing fundamental" which happens for some musical tones). > > The OP gets to select the snippet of audio that he is interested in, so he > can select a passage in which > > A) the singing is just beginning > B) there are no other voices, other than the section he is interested in > C) there is only a single note, which the choir section is trying to hit > > He wants to find out how the section converges (or not) on the note. > > So, not being an expert, perhaps I'm missing some crucial point here, > but why can't he just use an overlapping (or maybe even non-overlapping) > set of sequential FFTs to track the frequency changes?
which frequency? it's unlikely that there is a single "bump" in the spectra to track. the "fundamental frequency" thatthe OP wants to track is one frequency of many in a tone that is not a single sinusoid. i'm not saying that *you're* doing this, Bob, but just for the purpose of clarity, i want to point out that IMO the FFT is appealed to a lot to solve whatever frequency or pitch problem that crops up and it's a tool or weapon that does little effectively unless it's wielded well. so you tale a little snippet of audio (maybe you window it) and pass that to an FFT, you get the complex output, bitreverse, and compute the magnitude or magnitude-squared (or log of it) and look at it. fine, now what are you (or the automated process that you're writing code for) going to do with it? if you pick the frequency of the bin of the maximum amplitude (ignoring interpolation issues for the moment), that might be useful, but it might be a harmonic, not the fundamental. especially for vocals when the singer's mouth is a highly resonant chamber, like when they sing "oooooo" and move the tongue around, the most resonant frequency is changing, even when the pitch (or fundamental frequency) is not. for a single monotonic note, a legitimate method some people might take care of this is to sample the spectrum magnitude at equally spaced points (as possible harmonics) and add that up, and the spacing that results in the maximum amplitude or energy would be the resultant fundamental frequency. but think about that for a moment: that is like multiplying the spectrum by the frequency response of some kinda comb filter. so now think about what this comb filter would be like in the time domain (it's the input added to a delayed version of the input). if you toss in a little scaling (1/2) and flip the thing over (subtract it from 1), you have effectively something that looks a lot like the AMDF or ASDF. sometimes the simplest implementation is, well, the simplest. r b-j
Ron N. wrote:
> > It sounds like the OP is a music teacher who is present at the > time of the recordings. Therefore he probably has access to > a printed score sheet, and thus can calculate the approximate > pitch represented by the note that a particular part is supposed > to be singing at a particular point in the score (assuming the choir > isn't way out of tune of course). That will solve the a priori > problem you mention.
i would think that the comb filter would need to be precisely tuned (A = 440 Hz is not good enough when the person is actually singing 443 Hz). so a real pitch tracker (that might be informed by the note they're *supposed* to be singing or perhaps not) that results in a pretty precise value (precision is not necessarily accuracy, but we would like to hope that it's also accurate), to define the precise delay (using fractional sample interpolation) of the comb filter, might be needed to really knock out that tone. so we need to know the precise frequency of the tone so we can tune a comb filter to extract it so we can accurately measure the frequency of it. now if they're separately miced and recorded on separate tracks, each pitch detector applied to each track can be assured that they're looking at a single tone or voice with no need to separate them. the *only* frequency components (of significant energy) that they're looking at are the harmonics of a *single* fundamental. it is a quasi-periodic function and pitch trackers can work pretty well on that kind of input.
> If the other parts aren't singing octaves, why wouldn't a notch filter > over the other parts,
you have to know what those other parts are, a priori, to tune the notch filter(s).
> or a bandpass filter around the dominant > overtone related to the pitch of interest, help reject interference > to an autocorrelation?
well, pitch shifting using the phase vocoder is pretty blind to the actual pitches in the input (which is one reason it works for complete mixes), but time-domain pitch shifting of multiple voices, polyphonic pitch extraction, and source or instrument separation are still rather unsolved problems in the state of the art of audio DSP at the moment. i remember a while ago Rob Maher did a paper in separating duet recordings. this algorithm only had to worry about *two* separate tones (and the interlacing and interaction of two sets of harmonics) and even so, the problem was a female canine (that didn't always work). with more voices (than 2), i imagine it would be a hideous monster. r b-j
Hello!

I'm really sorry this reply comes in so late, I did not realise this
newsgroup would move so fast. Well, better late than never.

Ron N. wrote:
> Bob Monsen wrote: > > So, not being an expert, perhaps I'm missing some crucial point here, > > but why can't he just use an overlapping (or maybe even non-overlapping) > > set of sequential FFTs to track the frequency changes? > > It looks like it's the standard problem of resolution in frequency > versus resolution in time. To get more resolution in frequency from > an FFT requires a longer FFT window, which, without zero padding, > results in less resolution in time, and/or might even require a > buffer longer than the sound of interest which would introduce > outside time-domain interference.
Yes, that's exactly what I meant. If I want more accuracy in frequency, I need a longer window and resolution in time decreases. A few points of this project that I could not make clear enough: 1) Yes, I have the whole choir on one track. Or actually two, but it being in stereo does not help at all. So this is a real challenge. I have already recorded what I'm about to record (over a year ago) so now I'm just wondering how to get best results out of it. I did ponder quite a bit about wether to use multi-track recording systems but then I decided against it. As Richard Dobson wonderfully said, "a capella choral singing is an extra-ordinary socio-acoustic phenomenon". So I thought that if I took the choir into a studio and tried to record voices separately, I would lose the basic essence of the phenomenon I'm trying to study. I mean the situation would be so far from a real choir-singing event that it would not be worth very much to me. I also thought about things being easier if I had 4 or 8 singers instead of 35, but that kind of "barbershop" music is not what I'm interested in. I'm fascionated about the way 35 singers can work like a single instrument without anyone [inside] ever thinking about how 35 throats, 70 ears and 35 brains can possibly do anything together. 2) Our fft method this far consists of taking a series of 0.2 sec slices (8820 samples) at 0.1 sec intervals (50% overlapping?), running them through a hanning window, zeropadding with a million zeros and plotting the result with frequency and amplitude. We have also made 3D plottings of all the slices (time, freq, amp) and it works nicely but now I'm more interested in getting the actual data out of there. 3) The autocorrelation thingy we tried didn't work because it seemed to be highly vulnerable to noise. We knew that it was only supposed to work if I had voices on separate tracks, but we thought we might check if we could use very heavy filtering and isolate the fundamentals that we are trying to measure (measure, not "find" since I know what's being sung there, so I know with 5% accuracy where those fundamentals are). So before trying any filtering we built an autocorrelation system and tested it with a 440Hz matlab-made sinus waveform. It was correct and accurate. Then we thought to test the extremely noisy scenario; we made a waveform that had the sum of 400Hz and 500Hz with equal amplitudes and fed it to the beast. As a result we got a very accurate, nice and clean 441.xxx Hz. I'm not sure of the actual result, but something like that. What scared us off was the fact that the result was as clean as it would have been if there had been a single noiseless wave of that frequency. All the noise was just gone. Nothing suggested that anything was wrong with the result. I know our noise example was an extreme one, but it revealed that power in other frequencies than the one being measured will pull the result in the direction of the noise, and that's the last thing we might want. Is this how autocorrelation is supposed to work or did we do something wrong? I have some real noise issues there also, because it was not a studio recording, not least of which is a DC brumm in the left channel, but I think I'll handle that one by filtering out everything below 55Hz (there's not anything I'm interested in down there).
> In music, the strongest frequency present might be an overtone of > the musical pitch. Autocorrelation can help determine if this is > the situation. But if one already knows the approximate pitch, one > can measure the dominant frequency and divide that down to get more > precise pitch information.
I wonder how often that situation accurs. I'm not sure if you mean the situation where a person singing alone (or many people singing the same note) produces a sound where an overtone is dominant, or a situation where that comes from different voices' overtones boosting each other, for example to people singing a fifth apart at 200 and 300 Hz and a common overtone of 600Hz being dominant. The situation what I'm worried about is when an upper voice is singing a pitch that is simulaneously an overtone of a lower voice. Let's say bases sing 110Hz (A) and altos sing 330Hz (E an octave and a half higher). And let's say tuning is not perfect. Now there would be a peak at 330Hz that's the sum of altos and bases. How much would the bases' overtone have effect on the peak at 330Hz? Let's say that bases sing 108Hz making their 3rd partial tone 324Hz. If I would try to study it with a non-zeropadded fft I would get such a wide "hill" from the altos' fundamental (330Hz) that there would be quite a lot of power shown also at 324Hz. Now if we add the power from bases to that I believe 324Hz would become the peak instead. I'm hoping to tackle that by zeropadding (making the peaks steeper) but I'm not sure if it works. And it would be nasty towards the singers to assume that "if the upper pitch is in perfect harmony, it must be an overtone" :>. Ron Nicholson wrote:
> As you found, zero-padding and using a long fft, although a very > accurate method of interpolating frequency, will not show fine > detail in the frequency envelope. However frequency is the > derivative of phase. So what I might try is a technique from > phase vocoding. Use overlapped successive short fft's and > compare the phase changes in the nearest bin of interest with > what would be the phase change represented by the overlap > offset. Plot that phase difference. The slope of the plot will > represent the frequency offset from the fft bin center, and any > curvature in the plot will represent a change in frequency. > > This could work with fft windows as short as maybe a dozen > cycles or less of the dominant frequency (which itself may be > an overtone of the fundamental pitch), so you can get much > better time resolution. For 330 Hz, maybe try 75% overlapped > windows as short as maybe 1024 samples of 44.1 KHz.
Wow.. would this also work with my case where all the voices are on the same track or for individual voices only? I have not yet figured out what you mean by this but I will, and anyway to understand the answer is on askers responsibility. One key question: What do you mean by "nearest bin of interest"? (or rather, what does a "bin" mean in this context?) A stupid question, perhaps, but I know only know only as many things about signal processing as I have come across in this project, and also it was my friend who wrote the actual code although I know what the code does. I can read it but not yet write :) Richard Dobson wrote:
> Excellent though Matlab may be for audio analysis, you may find that the > advanced tools designed specifically for analysing musical audio may suit your > purposes better. For example, the most comprehensive system around at the moment > seems to be the CLAM suite of tools (all GPL with sources, but full binary > installer-based packages are now available) from University Pompeu Fabra:
Thanks a lot!! I will check that out throughly. And thank you for the insights on more musical matters! I didn't realise that singers approach the tones from below, although I have practical experiense on the matter. I mean, when you mentioned it I realised that's just what I do. I have a pianist backround so I tend to consider harmony a "vertical" thing. When I sing in a choir, I don't think very much about succesive intervals being pure, but instead I compare my voice with other simultaneously occuring voices. I surely don't assume that equal tuning would be the basis for choirs, instead I'd like to really find out what is the amateur reality of tuning; by researching the practise I'm trying to find out what is the hidden ideal of tuning that they are trying to achieve, if there is one. There is one paradox that I'm particularly interested in, and this goes way out of topic of DSP: Lets say choir sings in C major a following typical cadence with very slow chords: C, F, dm7, G7, C or in other words I - IV - II7 - V7 - I. The F major is tuned so that F and C make a perfet 5th and the A is a pure major 3rd above F (a 4:5:6 major chord). In dm7 the D in introduced - it will go to perfect 5th below the A while other notes remain untouched. Now we go to G7 where singers of the D note will hold their pitch from previous chord. G is tuned by the D and while doing so, it will become lower than it was in the beginning, by a syntonic comma (80:81). I know there are solutions to this but what I think is weird about this phenomenon is that everybody are singing in perfect harmony and our tonic is falling. I'd like to know how this thing is dealt with "in real life". I must sadly confess (to Ron N. and Robert Bristow-Johnson) that I actually didn't understand (yet) much of your later conversation, and therefore I fail at trying to comment it, although I would very much like to. Still one more question: It's usually recommended to use zeropadding by something like the samples length of zeros. What effect does outrageous zeropadding (a million zeros after a 8820 long sample) have on fft's reliability? Would it be an issue that would ruin our results? It certainly makes results more accurate, but would that just mean adding decimals to a guess or would it really be more accurate? Thanks very much to all of you who replied! Erkki Nurmi
dieselviulu@yahoo.ca wrote:
> Ron Nicholson wrote: > > In music, the strongest frequency present might be an overtone of > > the musical pitch. Autocorrelation can help determine if this is > > the situation. But if one already knows the approximate pitch, one > > can measure the dominant frequency and divide that down to get more > > precise pitch information. > > I wonder how often that situation accurs. I'm not sure if you mean the > situation where a person singing alone (or many people singing the same > note) produces a sound where an overtone is dominant,
I think a male bass voice can easily produce more overtone than fundamental pitch energy. (Or maybe it's just the cheap microphones I've been using which roll off everything in the low register.) I usually measure the strongest overtone because that's where the S/N ratio is better, then divide down to get the pitch (which might be inferred by ear, or perhaps by rough autocorrelation or cepstral methods).
> Ron Nicholson wrote: > > As you found, zero-padding and using a long fft, although a very > > accurate method of interpolating frequency, will not show fine > > detail in the frequency envelope. However frequency is the > > derivative of phase. So what I might try is a technique from > > phase vocoding. Use overlapped successive short fft's and > > compare the phase changes in the nearest bin of interest with > > what would be the phase change represented by the overlap > > offset. Plot that phase difference. The slope of the plot will > > represent the frequency offset from the fft bin center, and any > > curvature in the plot will represent a change in frequency. > > > > This could work with fft windows as short as maybe a dozen > > cycles or less of the dominant frequency (which itself may be > > an overtone of the fundamental pitch), so you can get much > > better time resolution. For 330 Hz, maybe try 75% overlapped > > windows as short as maybe 1024 samples of 44.1 KHz. > > Wow.. would this also work with my case where all the voices are on the > same track or for individual voices only? I have not yet figured out > what you mean by this but I will, and anyway to understand the answer > is on askers responsibility. One key question: What do you mean by > "nearest bin of interest"? (or rather, what does a "bin" mean in this > context?)
I'm not sure if any analysis can track an individual voice inside a large choral mix. An FFT produces output for discrete frequencies (multiples of the sample rate divided by the FFT length). Not sure if it's standard usage, but I call a bin either one of those discrete frequencies or a window of frequencies with the width equal to the distance between the discrete frequencies. Most of the energy of a particular frequency will show up in in nearest discrete frequency bin, even if not exactly equal to one of the bin center frequencies. So you'd choose that bin for phase analysis across FFT frames. Given a 50% overlap, any exact even-numbered bin frequency should have the same phase in successive frames, any odd-numbered one the phase inverted, and any deviation from this likely represents a frequency not exactly at the bin center or a changing pitch (vibrato or glissando, etc.).
> Still one more question: It's usually recommended to use zeropadding by > something like the samples length of zeros. What effect does outrageous > zeropadding (a million zeros after a 8820 long sample) have on fft's > reliability? Would it be an issue that would ruin our results? It > certainly makes results more accurate, but would that just mean adding > decimals to a guess or would it really be more accurate?
Zero-padding an FFT is almost exactly the same as interpolating very smoothly between the data points resulting from an un-padded FFT. More zeros mostly adds more smoothness. Any ripples theoretically resolvable by zero-padding interpolation are most likely hidden in quantization noise. Others here have posted that a 3 point parabolic curve fit gives good results at a much lower computational cost. Thanks for bringing an interesting question to comp.dsp. IMHO. YMMV. -- rhn A.T nicholson d.0.t C-o-M
dieselviulu@yahoo.ca wrote:
> Still one more question: It's usually recommended to use zeropadding by > something like the samples length of zeros. What effect does outrageous > zeropadding (a million zeros after a 8820 long sample) have on fft's > reliability? Would it be an issue that would ruin our results? It > certainly makes results more accurate, but would that just mean adding > decimals to a guess or would it really be more accurate?
I just tried this on some synthetic sinusoids: zero-padded 8192 windowed samples up to an FFT vector about a million samples long, which did give me better results. But the much longer FFT is a lot slower; and the frequency accuracy didn't seem any better than those produced by phase vocoder analysis on successive and much shorter 1024 sample long FFT frames. The latter phase vocoder method has very low cost beyond that of the FFT, other than that of storing phase information from the previous FFT frame of course. Plus the shorter frames allow clearly better time resolution of any frequency changes. But if you have enough compute power that efficiency doesn't matter, your outrageous zero-padding probably won't do anything that would ruin your results (assuming you can get any interesting results from a large chorus recorded with a single mic). IMHO. YMMV. -- rhn A.T nicholson d.0.t C-o-M