comp.dsp | Pitch - general questions about accuracy of detection for voice

Hey all,

I'm working on a small project that has a DSP component. A part of what it
does is detect the frequency of recorded human voice (singing, not speech.
One singer recorded to a single channel)

The code is currently getting the correct note in the range 155-2100Hz
about 80% of the time. It's currently just picking the frequency with the
highest amplitude as the fundamental from each array of partials returned
by an FFT (after phase-based post processing of the data the FFT returns).
It identifies piano notes in the same range correctly 100% of the time.

I have three questions:

1. generally speaking how accurately is it possible to detect notes from
such a voice recording (on a scale of 1-100)? Is it possible to get close
to 100% detection accuracy while maintaining a decent time resolution (say
100ms)?

2. what algorithm would be best for this? (given that performance is a
concern, but not a huge concern)

3. given an FFT approach, how much difference would a peak picker that
scans the distance between upper harmonics to get the fundamental make?

Thanks,

Paul

Reply by Vladimir Vassilevsky ●December 3, 20092009-12-03


Paul Thorn wrote:
> Hey all,
> 
> I'm working on a small project that has a DSP component. A part of what it
> does is detect the frequency of recorded human voice (singing, not speech.
> One singer recorded to a single channel)
> 
> The code is currently getting the correct note in the range 155-2100Hz
> about 80% of the time.

The pitch range of human voice is ~50...500 Hz.

> It's currently just picking the frequency with the
> highest amplitude as the fundamental from each array of partials returned
> by an FFT (after phase-based post processing of the data the FFT returns).
> It identifies piano notes in the same range correctly 100% of the time.

Component with the highest amplitude very well may not be the pitch 
fundamental. You will get much better results if you analyse the spacing 
between harmonic peaks.

> I have three questions:
> 
> 1. generally speaking how accurately is it possible to detect notes from
> such a voice recording (on a scale of 1-100)?  Is it possible to get close
> to 100% detection accuracy while maintaining a decent time resolution (say
> 100ms)?

+100. 100ms is plenty unless in some pathological cases.

> 2. what algorithm would be best for this? (given that performance is a
> concern, but not a huge concern)

I like normalized autocorrelation approach.

> 3. given an FFT approach, how much difference would a peak picker that
> scans the distance between upper harmonics to get the fundamental make?

That's one of the most accurate methods.


Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

Reply by Greg Berchin ●December 3, 20092009-12-03

On Thu, 03 Dec 2009 08:29:14 -0600, Vladimir Vassilevsky <nospam@nowhere.com>
wrote:

>> It's currently just picking the frequency with the
>> highest amplitude as the fundamental from each array of partials returned
>> by an FFT (after phase-based post processing of the data the FFT returns).
>> It identifies piano notes in the same range correctly 100% of the time.
>
>Component with the highest amplitude very well may not be the pitch 
>fundamental. You will get much better results if you analyse the spacing 
>between harmonic peaks.

I'll second that.  And, in fact, missing harmonics, or even a missing
fundamental, occurs very often.  So analyze the spacing of many harmonics and
find the largest common factor.

>> 2. what algorithm would be best for this? (given that performance is a
>> concern, but not a huge concern)

RAPT (Robust Algorithm for Pitch Tracking), David Talkin, in "Speech
Coding and Synthesis", edited by Kleijn and Paliwal.  You'll find an
implementation in the Matlab VoiceBox Toolkit, available from  "http://
www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html".  Finding a copy
of the original article is extremely difficult, however.  Try a search for
"fxrapt".

HPS (Harmonic Product Spectrum),  M. R. Schroeder, "Period Histogram
and Product Spectrum:  New Methods for Fundamental-Frequency
Measurement", The Journal of the Acoustical Society of America, Volume
43 Number 4, 1968.

-- Greg

Reply by Richard Dobson ●December 3, 20092009-12-03

Vladimir Vassilevsky wrote:
> 
> 
> Paul Thorn wrote:
>> Hey all,
>>
>> I'm working on a small project that has a DSP component. A part of 
>> what it
>> does is detect the frequency of recorded human voice (singing, not 
>> speech.
>> One singer recorded to a single channel)
>>
>> The code is currently getting the correct note in the range 155-2100Hz
>> about 80% of the time.
> 
> The pitch range of human voice is ~50...500 Hz.
> 

Male voice, more or less. A tenor's "top C" is the ~523Hz one. A 
coloratura soprano is expected to reach at least "top F in alt", 
~1396Hz, and even higher**. and young children can squeal somewhat 
higher given appropriate provocation/enthusiasm (though I know of no 
scientifically rigorous study). But 2100Hz is definitely beyond the 
range of any (known) adult singer - approx "Top-C" on the flute. Of 
course, at those heights most formants are left well behind, so such 
notes have no recognisable vowels. The "singer's formant" (classical 
Western Art Music Production - WAMP) around 2KHz can however be very 
prominent in a note, however, and I can well imagine it might trigger a 
pitch detector that just went for the most prominent partial.

**There is a general if largely unwritten principle that no note a 
singer sings in public should be the absolute highest note they can 
reach - there must always be a little slack in the system. So any 
soprano able to hit a top F at full WAMP must really be able (in 
private) to reach the G too.

Richard Dobson

Reply by Fred Marshall ●December 3, 20092009-12-03

Vladimir Vassilevsky wrote:
.....snip.....
> 
> +100. 100ms is plenty unless in some pathological cases.
> 

Not much has been said about resolution, just about % of detection....

I'd worry about 60Hz vs. 55Hz vs 50Hz vs. 53.4Hz with only 100ms.
What's the requirement for resolution and how might this short segment 
affect the outcome for lower frequency pitch?

Fred

Reply by Greg Berchin ●December 3, 20092009-12-03

On Thu, 03 Dec 2009 17:36:53 -0800, Fred Marshall
<fmarshallx@remove_the_xacm.org> wrote:

>Not much has been said about resolution, just about % of detection....
>
>I'd worry about 60Hz vs. 55Hz vs 50Hz vs. 53.4Hz with only 100ms.
>What's the requirement for resolution and how might this short segment 
>affect the outcome for lower frequency pitch?

For detecting the fundamental, the resolution limits to which you allude are
significant at 100 ms.  But for determining distance between harmonics, not so
much.  I have observed many voice samples in which harmonics as high as #30 were
clearly discernible.  

Greg

Reply by fatalist ●December 4, 20092009-12-04

On Dec 3, 7:48=A0am, "Paul Thorn" <pthor...@gmail.com> wrote:
> Hey all,
>
> I'm working on a small project that has a DSP component. A part of what i=
t
> does is detect the frequency of recorded human voice (singing, not speech=
.
> One singer recorded to a single channel)
>
> The code is currently getting the correct note in the range 155-2100Hz
> about 80% of the time. It's currently just picking the frequency with the
> highest amplitude as the fundamental from each array of partials returned
> by an FFT (after phase-based post processing of the data the FFT returns)=
.
> It identifies piano notes in the same range correctly 100% of the time.
>
> I have three questions:
>
> 1. generally speaking how accurately is it possible to detect notes from
> such a voice recording (on a scale of 1-100)? Is it possible to get close
> to 100% detection accuracy while maintaining a decent time resolution (sa=
y
> 100ms)?
>
> 2. what algorithm would be best for this? (given that performance is a
> concern, but not a huge concern)
>
> 3. given an FFT approach, how much difference would a peak picker that
> scans the distance between upper harmonics to get the fundamental make?
>
> Thanks,
>
> Paul

First, do yourself a BIG favor and forget about picking FFT partials

For human singing the frequency range would be approx 80-1100 Hz -
about 13 times difference.
And, unlike piano, human voice has those things called "formants", and
to make matters worse, it's fundamental frequency F0 (it's more
correct to talk about it's inverse - fundamental period T0, instead)
is not constant but changes all the time (things like vibrato etc.)

As far as algorithms go...

Autocorrelation will do, AMDF will do... to some extent and with a lot
of tweaking

However, those methods are obsolete: the modern state of the art is
described in US Patent 7124075
( http://www.google.com/patents/about?id=3DdB97AAAAEBAJ&dq=3D7124075 )

Your real challenge will be in designing an algorithm which can adjust
its analysis window dynamically as fundamental frequency changes all
the time: ideally you need your window to cover at least 2 complete
periods with autocorrelation or AMDF (or even less than 2 complete
periods with 7124075 for a lot better time resolution), but not more
than 3-4 periods - otherwise your time resolution will be lost

And YES, you can get very close to 100% (if your singer is not
esophageal)

This is as far as free advice goes

Do not forget to tell us if your little project is commercially
successfull :-)

Reply by Vladimir Vassilevsky ●December 4, 20092009-12-04

fatalist wrote:

> However, those methods are obsolete: the modern state of the art is
> described in US Patent 7124075
> ( http://www.google.com/patents/about?id=dB97AAAAEBAJ&dq=7124075 )

Wow, we forgot of the ABSOLUTELY THE BEST ULTIMATE PITCH DETECTOR of 
Dmitry Teres.

Nobody knows what it is.
Nobody knows how it compares to other detectors.
Nobody uses it (except the author, may be).

Yet it is THE BEST ULTIMATE PITCH DETECTOR, AND YOU ABSOLUTELY HAVE TO 
USE IT.

> Do not forget to tell us if your little project is commercially
> successfull :-)

VLV

Reply by fatalist ●December 4, 20092009-12-04

On Dec 4, 10:51=A0am, Vladimir Vassilevsky <nos...@nowhere.com> wrote:
> fatalist wrote:
> > However, those methods are obsolete: the modern state of the art is
> > described in US Patent 7124075
> > (http://www.google.com/patents/about?id=3DdB97AAAAEBAJ&dq=3D7124075)
>
> Wow, we forgot of the ABSOLUTELY THE BEST ULTIMATE PITCH DETECTOR of
> Dmitry Teres.
>
> Nobody knows what it is.
> Nobody knows how it compares to other detectors.
> Nobody uses it (except the author, may be).
>
> Yet it is THE BEST ULTIMATE PITCH DETECTOR, AND YOU ABSOLUTELY HAVE TO
> USE IT.
>
> > Do not forget to tell us if your little project is commercially
> > successfull :-)
>
> VLV

I guess some folks actually used it (if not commercially) and endorsed
it:

http://www.springerlink.com/content/g05x815817536777/

Haven't seen your contribution there :)

You need to work on your spelling, dude

I suggest to do Google look-ups before misspelling proper names

Pitch - general questions about accuracy of detection for voice

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group