Forums

Pitch - general questions about accuracy of detection for voice

Started by Paul Thorn December 3, 2009
Hey all,

I'm working on a small project that has a DSP component. A part of what it
does is detect the frequency of recorded human voice (singing, not speech.
One singer recorded to a single channel)

The code is currently getting the correct note in the range 155-2100Hz
about 80% of the time. It's currently just picking the frequency with the
highest amplitude as the fundamental from each array of partials returned
by an FFT (after phase-based post processing of the data the FFT returns).
It identifies piano notes in the same range correctly 100% of the time.

I have three questions:

1. generally speaking how accurately is it possible to detect notes from
such a voice recording (on a scale of 1-100)? Is it possible to get close
to 100% detection accuracy while maintaining a decent time resolution (say
100ms)?

2. what algorithm would be best for this? (given that performance is a
concern, but not a huge concern)

3. given an FFT approach, how much difference would a peak picker that
scans the distance between upper harmonics to get the fundamental make?

Thanks,

Paul



Paul Thorn wrote:
> Hey all, > > I'm working on a small project that has a DSP component. A part of what it > does is detect the frequency of recorded human voice (singing, not speech. > One singer recorded to a single channel) > > The code is currently getting the correct note in the range 155-2100Hz > about 80% of the time.
The pitch range of human voice is ~50...500 Hz.
> It's currently just picking the frequency with the > highest amplitude as the fundamental from each array of partials returned > by an FFT (after phase-based post processing of the data the FFT returns). > It identifies piano notes in the same range correctly 100% of the time.
Component with the highest amplitude very well may not be the pitch fundamental. You will get much better results if you analyse the spacing between harmonic peaks.
> I have three questions: > > 1. generally speaking how accurately is it possible to detect notes from > such a voice recording (on a scale of 1-100)? Is it possible to get close > to 100% detection accuracy while maintaining a decent time resolution (say > 100ms)?
+100. 100ms is plenty unless in some pathological cases.
> 2. what algorithm would be best for this? (given that performance is a > concern, but not a huge concern)
I like normalized autocorrelation approach.
> 3. given an FFT approach, how much difference would a peak picker that > scans the distance between upper harmonics to get the fundamental make?
That's one of the most accurate methods. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
On Thu, 03 Dec 2009 08:29:14 -0600, Vladimir Vassilevsky <nospam@nowhere.com>
wrote:

>> It's currently just picking the frequency with the >> highest amplitude as the fundamental from each array of partials returned >> by an FFT (after phase-based post processing of the data the FFT returns). >> It identifies piano notes in the same range correctly 100% of the time. > >Component with the highest amplitude very well may not be the pitch >fundamental. You will get much better results if you analyse the spacing >between harmonic peaks.
I'll second that. And, in fact, missing harmonics, or even a missing fundamental, occurs very often. So analyze the spacing of many harmonics and find the largest common factor.
>> 2. what algorithm would be best for this? (given that performance is a >> concern, but not a huge concern)
RAPT (Robust Algorithm for Pitch Tracking), David Talkin, in "Speech Coding and Synthesis", edited by Kleijn and Paliwal. You'll find an implementation in the Matlab VoiceBox Toolkit, available from "http:// www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html". Finding a copy of the original article is extremely difficult, however. Try a search for "fxrapt". HPS (Harmonic Product Spectrum), M. R. Schroeder, "Period Histogram and Product Spectrum: New Methods for Fundamental-Frequency Measurement", The Journal of the Acoustical Society of America, Volume 43 Number 4, 1968. -- Greg
Vladimir Vassilevsky wrote:
> > > Paul Thorn wrote: >> Hey all, >> >> I'm working on a small project that has a DSP component. A part of >> what it >> does is detect the frequency of recorded human voice (singing, not >> speech. >> One singer recorded to a single channel) >> >> The code is currently getting the correct note in the range 155-2100Hz >> about 80% of the time. > > The pitch range of human voice is ~50...500 Hz. >
Male voice, more or less. A tenor's "top C" is the ~523Hz one. A coloratura soprano is expected to reach at least "top F in alt", ~1396Hz, and even higher**. and young children can squeal somewhat higher given appropriate provocation/enthusiasm (though I know of no scientifically rigorous study). But 2100Hz is definitely beyond the range of any (known) adult singer - approx "Top-C" on the flute. Of course, at those heights most formants are left well behind, so such notes have no recognisable vowels. The "singer's formant" (classical Western Art Music Production - WAMP) around 2KHz can however be very prominent in a note, however, and I can well imagine it might trigger a pitch detector that just went for the most prominent partial. **There is a general if largely unwritten principle that no note a singer sings in public should be the absolute highest note they can reach - there must always be a little slack in the system. So any soprano able to hit a top F at full WAMP must really be able (in private) to reach the G too. Richard Dobson
Vladimir Vassilevsky wrote:
.....snip.....
> > +100. 100ms is plenty unless in some pathological cases. >
Not much has been said about resolution, just about % of detection.... I'd worry about 60Hz vs. 55Hz vs 50Hz vs. 53.4Hz with only 100ms. What's the requirement for resolution and how might this short segment affect the outcome for lower frequency pitch? Fred
On Thu, 03 Dec 2009 17:36:53 -0800, Fred Marshall
<fmarshallx@remove_the_xacm.org> wrote:

>Not much has been said about resolution, just about % of detection.... > >I'd worry about 60Hz vs. 55Hz vs 50Hz vs. 53.4Hz with only 100ms. >What's the requirement for resolution and how might this short segment >affect the outcome for lower frequency pitch?
For detecting the fundamental, the resolution limits to which you allude are significant at 100 ms. But for determining distance between harmonics, not so much. I have observed many voice samples in which harmonics as high as #30 were clearly discernible. Greg
On Dec 3, 7:48=A0am, "Paul Thorn" <pthor...@gmail.com> wrote:
> Hey all, > > I'm working on a small project that has a DSP component. A part of what i=
t
> does is detect the frequency of recorded human voice (singing, not speech=
.
> One singer recorded to a single channel) > > The code is currently getting the correct note in the range 155-2100Hz > about 80% of the time. It's currently just picking the frequency with the > highest amplitude as the fundamental from each array of partials returned > by an FFT (after phase-based post processing of the data the FFT returns)=
.
> It identifies piano notes in the same range correctly 100% of the time. > > I have three questions: > > 1. generally speaking how accurately is it possible to detect notes from > such a voice recording (on a scale of 1-100)? Is it possible to get close > to 100% detection accuracy while maintaining a decent time resolution (sa=
y
> 100ms)? > > 2. what algorithm would be best for this? (given that performance is a > concern, but not a huge concern) > > 3. given an FFT approach, how much difference would a peak picker that > scans the distance between upper harmonics to get the fundamental make? > > Thanks, > > Paul
First, do yourself a BIG favor and forget about picking FFT partials For human singing the frequency range would be approx 80-1100 Hz - about 13 times difference. And, unlike piano, human voice has those things called "formants", and to make matters worse, it's fundamental frequency F0 (it's more correct to talk about it's inverse - fundamental period T0, instead) is not constant but changes all the time (things like vibrato etc.) As far as algorithms go... Autocorrelation will do, AMDF will do... to some extent and with a lot of tweaking However, those methods are obsolete: the modern state of the art is described in US Patent 7124075 ( http://www.google.com/patents/about?id=3DdB97AAAAEBAJ&dq=3D7124075 ) Your real challenge will be in designing an algorithm which can adjust its analysis window dynamically as fundamental frequency changes all the time: ideally you need your window to cover at least 2 complete periods with autocorrelation or AMDF (or even less than 2 complete periods with 7124075 for a lot better time resolution), but not more than 3-4 periods - otherwise your time resolution will be lost And YES, you can get very close to 100% (if your singer is not esophageal) This is as far as free advice goes Do not forget to tell us if your little project is commercially successfull :-)

fatalist wrote:

> However, those methods are obsolete: the modern state of the art is > described in US Patent 7124075 > ( http://www.google.com/patents/about?id=dB97AAAAEBAJ&dq=7124075 )
Wow, we forgot of the ABSOLUTELY THE BEST ULTIMATE PITCH DETECTOR of Dmitry Teres. Nobody knows what it is. Nobody knows how it compares to other detectors. Nobody uses it (except the author, may be). Yet it is THE BEST ULTIMATE PITCH DETECTOR, AND YOU ABSOLUTELY HAVE TO USE IT.
> Do not forget to tell us if your little project is commercially > successfull :-)
VLV
On Dec 4, 10:51=A0am, Vladimir Vassilevsky <nos...@nowhere.com> wrote:
> fatalist wrote: > > However, those methods are obsolete: the modern state of the art is > > described in US Patent 7124075 > > (http://www.google.com/patents/about?id=3DdB97AAAAEBAJ&dq=3D7124075) > > Wow, we forgot of the ABSOLUTELY THE BEST ULTIMATE PITCH DETECTOR of > Dmitry Teres. > > Nobody knows what it is. > Nobody knows how it compares to other detectors. > Nobody uses it (except the author, may be). > > Yet it is THE BEST ULTIMATE PITCH DETECTOR, AND YOU ABSOLUTELY HAVE TO > USE IT. > > > Do not forget to tell us if your little project is commercially > > successfull :-) > > VLV
I guess some folks actually used it (if not commercially) and endorsed it: http://www.springerlink.com/content/g05x815817536777/ Haven't seen your contribution there :) You need to work on your spelling, dude I suggest to do Google look-ups before misspelling proper names