DSPRelated.com
Forums

Pitch Detection (various methods)

Started by zotty December 31, 2008
Hi All,

I've implemented a pitch detector using a pretty brute force method.
PCM audio -> FFT(512) -> DCT (512), then do auto-correlation using the
current data and a history buffer. It's working great. The application
assumes singing on a potentially noisy channel. (Headset mic or open
mic configurations are possible, Karaoke ). I do some signal
normalization in the Cepstral domain.

This seems like overkill, but time domain stuff, ala zero crossing or
something, makes some huge assumptions regarding signal to noise, and
even DC offset. My assumption is this technique will usually be more
robust against different signal to noise characteristics. I'd like to
try out some different algorithms, perhaps time domain. what else
would you try?

Any thoughts?

Regards,
MarkZ
Annosoft
On Dec 31, 12:14&#4294967295;pm, zotty <mzart...@annosoft.com> wrote:
> > ... time domain stuff, ala zero crossing or > something, makes some huge assumptions regarding signal to noise, and > even DC offset. My assumption is this technique will usually be more > robust against different signal to noise characteristics. I'd like to > try out some different algorithms, perhaps time domain. what else > would you try? > > Any thoughts?
look up "Average Magnitude Difference Function" (AMDF) or the similar average squared difference function (sometimes called "ASDF") or autocorrelation. these methods make no assumptions regarding the signal other than that there is some degree of periodicity. they are mostly equivalent to each other in theory (the ASDF hits a minimum exactly where the autocorrelation hits a max). i tried to make a simple and formal description of ASDF in my old Wavetable Synthesis 101 paper (it's somewhere at musicdsp.org). so even though there are no zero-crossing issues, there are threshold issues that need to be worked out to avoid the "octave problem". even though there are no assumptions about the signal and noise, if you are willing to make a few, you might consider pre-filtering the signal going to the correlation operation. sometimes DC-blocking and LPFing can be helpful. r b-j

zotty wrote:
> Hi All, > > I've implemented a pitch detector
For what purpose?
> using a pretty brute force method. > PCM audio -> FFT(512) -> DCT (512), then do auto-correlation using the > current data and a history buffer.
Why smoke and mirrors instead of the trivial method of the normalized autocorrelation?
> It's working great. The application > assumes singing on a potentially noisy channel. (Headset mic or open > mic configurations are possible, Karaoke ). I do some signal > normalization in the Cepstral domain. > > This seems like overkill, but time domain stuff, ala zero crossing or > something, makes some huge assumptions regarding signal to noise, and > even DC offset. My assumption is this technique will usually be more > robust against different signal to noise characteristics. I'd like to > try out some different algorithms, perhaps time domain. what else > would you try?
There used to be Dmitry Terez here, who claimed that he invented the ultimate absolute revolutinary top secret pitch detection algorithm superior to anything else. I wonder what happened to him. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
On Dec 31, 1:53&#4294967295;pm, robert bristow-johnson <r...@audioimagination.com>
wrote:
> On Dec 31, 12:14&#4294967295;pm, zotty <mzart...@annosoft.com> wrote: > > > > > ... time domain stuff, ala zero crossing or > > something, makes some huge assumptions regarding signal to noise, and > > even DC offset. My assumption is this technique will usually be more > > robust against different signal to noise characteristics. I'd like to > > try out some different algorithms, perhaps time domain. what else > > would you try? > > > Any thoughts? > > look up "Average Magnitude Difference Function" (AMDF) or the similar > average squared difference function (sometimes called "ASDF") or > autocorrelation. &#4294967295;these methods make no assumptions regarding the > signal other than that there is some degree of periodicity. &#4294967295;they are > mostly equivalent to each other in theory (the ASDF hits a minimum > exactly where the autocorrelation hits a max). &#4294967295;i tried to make a > simple and formal description of ASDF in my old Wavetable Synthesis > 101 paper (it's somewhere at musicdsp.org). &#4294967295;so even though there are > no zero-crossing issues, there are threshold issues that need to be > worked out to avoid the "octave problem". > > even though there are no assumptions about the signal and noise, if > you are willing to make a few, you might consider pre-filtering the > signal going to the correlation operation. &#4294967295;sometimes DC-blocking and > LPFing can be helpful. > > r b-j
Hey r, thank you for the tips and expertise. I immediately turned to papers covering AMDF. I often have trouble translating these papers into working logic (sigh). it might just be my wetware limitations. I'd like to take a crack at understanding and I'm close, AMDF(t) = - 1/L * SUM(i, 1 to L) of ABS(s(i) - s(i - t)) The AMDF of t is the sum of the magnitude differences of a forward looking buffer size L against a delay line of size L. In code terms, this looks like a single "for" loop, with a simple subtraction of the current buffer against a delay line. This looks wrong though. if it were the case s(i-t) would be s(i-L) instead. I'm sorry for the clob headedness here, but does "t" introduce another loop? -- snippet without boundary checks -- given an infinite "audio_buffer" and an integer "current_sample" compute AMDT on the current sample. float temp = 0.0f; for (i = 0; i < L; i++) { temp += abs(audio_buffer[current_sample + i ] - audio_buffer [current_sample - L + i ] ); } AMDT[current_sample] = temp/L; Is that correct or is there actually another loop for "t" where the current sample is differenced against everything, another "tap" in there? Thanks for your help.
On Dec 31, 4:21&#4294967295;pm, zotty <mzart...@annosoft.com> wrote:
> On Dec 31, 1:53&#4294967295;pm, robert bristow-johnson <r...@audioimagination.com> > wrote: > > > > > > > On Dec 31, 12:14&#4294967295;pm, zotty <mzart...@annosoft.com> wrote: > > > > ... time domain stuff, ala zero crossing or > > > something, makes some huge assumptions regarding signal to noise, and > > > even DC offset. My assumption is this technique will usually be more > > > robust against different signal to noise characteristics. I'd like to > > > try out some different algorithms, perhaps time domain. what else > > > would you try? > > > > Any thoughts? > > > look up "Average Magnitude Difference Function" (AMDF) or the similar > > average squared difference function (sometimes called "ASDF") or > > autocorrelation. &#4294967295;these methods make no assumptions regarding the > > signal other than that there is some degree of periodicity. &#4294967295;they are > > mostly equivalent to each other in theory (the ASDF hits a minimum > > exactly where the autocorrelation hits a max). &#4294967295;i tried to make a > > simple and formal description of ASDF in my old Wavetable Synthesis > > 101 paper (it's somewhere at musicdsp.org). &#4294967295;so even though there are > > no zero-crossing issues, there are threshold issues that need to be > > worked out to avoid the "octave problem". > > > even though there are no assumptions about the signal and noise, if > > you are willing to make a few, you might consider pre-filtering the > > signal going to the correlation operation. &#4294967295;sometimes DC-blocking and > > LPFing can be helpful. > > > r b-j > > Hey r, > > thank you for the tips and expertise. > > I immediately turned to papers covering AMDF. I often have trouble > translating these papers into working logic (sigh). it might just be > my wetware limitations. I'd like to take a crack at understanding and > I'm close, > > AMDF(t) = - 1/L * SUM(i, 1 to L) of ABS(s(i) - s(i - t)) > > The AMDF of t is the sum of the magnitude differences of a forward > looking buffer size L against a delay line of size L. In code terms, > this looks like a single "for" loop, with a simple subtraction of the > current buffer against a delay line. > > This looks wrong though. if it were the case s(i-t) would be s(i-L) > instead. I'm sorry for the clob headedness here, but does "t" > introduce another loop? > -- snippet without boundary checks -- given an infinite "audio_buffer" > and an integer "current_sample" compute AMDT on the current sample. > > float temp = 0.0f; > for (i = 0; i < L; i++) > { > &#4294967295; &#4294967295; temp += abs(audio_buffer[current_sample + i ] - audio_buffer > [current_sample - &#4294967295;L + i ] );} > > AMDT[current_sample] = temp/L; > > Is that correct or is there actually another loop for "t" where the > current sample is differenced against everything, another "tap" in > there? > > Thanks for your help.- Hide quoted text - > > - Show quoted text -
Is the pitch acquisition time important to you? If it's a guitar synthesizer application then this is usually a big issue. Bob Adams
On Dec 31, 9:14&#4294967295;am, zotty <mzart...@annosoft.com> wrote:
> Hi All, > > I've implemented a pitch detector using a pretty brute force method. > PCM audio -> FFT(512) -> DCT (512), then do auto-correlation using the > current data and a history buffer. It's working great. The application > assumes singing on a potentially noisy channel. (Headset mic or open > mic configurations are possible, Karaoke ). I do some signal > normalization in the Cepstral domain. > > This seems like overkill, but time domain stuff, ala zero crossing or > something, makes some huge assumptions regarding signal to noise, and > even DC offset. My assumption is this technique will usually be more > robust against different signal to noise characteristics. I'd like to > try out some different algorithms, perhaps time domain. what else > would you try? > > Any thoughts?
Depends on your criteria. Latency? Frequency accuracy? Tracking FM or vibrato rate? Pitch duration or transition time measurement? I've got a list of frequency and pitch estimation methods that I've looked at or played with here: http://www.nicholson.com/rhn/dsp.html My current random opinion is that the human ear uses something which produces results a least slightly similar to FFT interpolated peak estimation for high frequency pitches, Harmonic Product Spectrum (sort of a poor man's Cepstrum) for middling frequency pitches, and AMDF for very low pitches. . -- rhn A.T nicholson d.0.t C-o-M
Hey Ron,
Thank you.

> Depends on your criteria. &#4294967295;Latency? &#4294967295;Frequency accuracy? > Tracking FM or vibrato rate? &#4294967295;Pitch duration or transition > time measurement?
I've got two usages, one concerning speech, the other singing. We license an sdk for doing automatic lip sync (mouth positions) given an audio file/audio stream. Pitch information on voiced phonemes is a useful cue for gesturing. In this case, I'm already doing a ton a processing (the magnitude spectra is available for free), so cepstrum work isn't very costly and latency isn't an issue. Second usage is for pitch estimation from live vocals during game play (like guitar hero or rock band). Semitone frequency accuracy is the goal. I expect that i want to try smooth out vibrato changes in the estimation, but it's probably not a huge deal. Cepstrum may still be available depending on whether the realtime phoneme extractor is also used to aid in the score. I think 80-100 milliseconds of latency will be acceptable. The problem, when really broken down, is not arbitrary pitch detection, but rather comparison of an audio signal with a given midi realization of the vocal track. It's likely to be a very noisy environment, so it's probably going to be more reliable to look for signal energy in bands near the expected semitone and score based on that. Mark Zartler Annosoft
On Dec 31 2008, 7:59 pm, Robert Adams <robert.ad...@analog.com> wrote:
> > Is the pitch acquisition time important to you? If it's a guitar > synthesizer application then this is usually a big issue. >
... On Jan 2, 12:09&#4294967295;pm, zotty <mzart...@annosoft.com> wrote:
> Hey Ron, > Thank you. > > > Depends on your criteria. &#4294967295;Latency? &#4294967295;Frequency accuracy? > > Tracking FM or vibrato rate? &#4294967295;Pitch duration or transition > > time measurement? > > I've got two usages, one concerning speech, the other singing. We > license an sdk for doing automatic lip sync (mouth positions) given an > audio file/audio stream. Pitch information on voiced phonemes is a > useful cue for gesturing. In this case, I'm already doing a ton a > processing (the magnitude spectra is available for free), so cepstrum > work isn't very costly and latency isn't an issue.
if delay is no problem, and you can afford to compute cepstrums, i think the autocorrelation or squared-difference methods (which can be expensive) as a starting point is your best bet. the creative part is looking at the autocorrelation or ASDF result and inferring the correct fundamental frequency out of that. there are lotsa subtle issues to worry about. most of these issues i won't talk about. but one common issue is the so-called "octave error" problem. consider a perfectly periodic tone with fundamental at 440 Hz. you would listen to it and say it sounds like A4 (or MIDI 69). but, mathematically, that tone is also a 220 Hz waveform (that happens to have all of its odd harmonics with zero amplitude). so whatever measure of periodicity that says the note is very periodic at 440 Hz will also measure the periodicity (of that very same note) at 220 Hz to be just as high. how do you prefer one estimate over the other? if you say that it is the highest possible fundamental frequency that results in a high degree of periodicity, then you begin to introduce a threshold to determine which candidate estimates are counted. on top of that, you can fool it with the appearance of low- level sub-harmonics. suppose your note is really an A440, but somehow it has a teeny bit of A220 (with some odd harmonic energy), attenuated by 80 dB, added to it. mathematically, it's a 220 Hz waveform (and you output MIDI 57) and not a 440 Hz waveform, but somehow it really sounds like 440 and somehow your pitch detector has to make the same judgment. if it's a simple threshold, then when some waveform approaches that threshold (and, in real life these waveforms come at you at inconvenient times) you hear the tracking pitch jump back and forth between what is likely the correct pitch and an octave (either up or down) off. have fun with it.
> Second usage is for pitch estimation from live vocals during game play
then latency is an issue, no?
> (like guitar hero or rock band). Semitone frequency accuracy is the > goal. I expect that i want to try smooth out vibrato changes in the > estimation, but it's probably not a huge deal.
well, you will find that nearly all human vocal pitch contours have lotsa variation in it and seldom lands right on (or really close to) the dead center of the semitone pitches. unless they're using a commercial pitch processor and switching it to the "Cher effect" (where the processed vocal has the pitch quantized tightly to the semitone pitches or some other preprogrammed list of notes) which is now really overused in pop music.
> Cepstrum may still be > available depending on whether the realtime phoneme extractor is also > used to aid in the score. &#4294967295;I think 80-100 milliseconds of latency will > be acceptable.
you can do a lot in 100 ms. be grateful that you have that much time to play with. my life would be easier if i had that much time.
> The problem, when really broken down, is not arbitrary pitch > detection, but rather comparison of an audio signal with a given midi > realization of the vocal track.
i dunno anything about MIDI files, but i know MIDI 1.0 pretty well. how do you represent precise pitches (between dead-center semitones) in a MIDI realization? i know it's just a protocol or file format issue (outside of MIDI 1.0), but i don't know how they define it.
> It's likely to be a very noisy > environment, so it's probably going to be more reliable to look for > signal energy in bands near the expected semitone and score based on > that.
well, a filter that is tuned to the integer harmonics of a common fundamental frequency that we'll call "the expected semitone", is a comb filter. if you consider a pitch detector that has a bank of various comb filters tuned to all of the candidate pitches, and you measure the output power of each comb filter by squaring and LPFing the squared output, that is essentially what the ASDF. if you absolute valued the outputs of the comb filters before LPFing, then it would be the more familiarly-titled AMDF. r b-j
On Wed, 31 Dec 2008 22:53:34 -0800 (PST), "Ron N."
<rhnlogic@yahoo.com> wrote:

   (snipped by Lyons)
> >Depends on your criteria. Latency? Frequency accuracy? >Tracking FM or vibrato rate? Pitch duration or transition >time measurement? > >I've got a list of frequency and pitch estimation methods that I've >looked at or played with here: > http://www.nicholson.com/rhn/dsp.html > >My current random opinion is that the human ear uses something which >produces results a least slightly similar to FFT interpolated peak >estimation for high frequency pitches, Harmonic Product Spectrum >(sort of a poor man's Cepstrum) for middling frequency pitches, >and AMDF for very low pitches. >
Hi Ron, I took a look at your above web page. I noticed at the bottom of the page, under the "Other Online DSP Resources" category you had the following line: a list of Online DSP books - from R. Lyons' article in IEEE Signal Processing. The URL you have there requires a visitor to be a "paid" member of the IEEE in order to see the "list of books". Ron, as it turns out I have that same list of books, and secondary list, at the following DspRelated.com web site: http://www.dsprelated.com/blogs-1/nf/Rick_Lyons.php I mention this to you because the two lists of books on the DspRelated.com web site are totally free of charge, and available to anyone. Regards, [-Rick-]