Hi all

I am trying to implement the energy threshold algorithm for voice activity
detection and not getting meaningful values for energy for frames of size wL.

wL = 1784 // about 40 ms

const double decay_constant = 0.90 // some optimal value between 0 and 1

double prevrms = 1.0 // avoid DivideByZero

double threshold = some optimal value after some experimentation

for (int i = 0; i < noSamples ; i += wL)

{

for (int j = 0; j < wL; j++)

{

// Exponential decay

total = total * decay_constant;

total += (audioSample[j] * audioSample[j]); // sum of squares

}

double mean = total / wL;

double rms = Math.Round(Math.Sqrt(mean),2); // root mean sqare

double prevrms = 1.0;

if(rms/prevrms > threshold)

{

// voice detected

}

prevrms = rms;

rms = 0.0;

}

Please advise what is wrong with the above implementation as rms computed for
every frame is calculated as 0.19.

The other issue is speed as it took about 30 minutes to execute the above.
Currently implemented as O(n2). Working with offline data so not a big deal as
achieving a accuracy is the main objective ut any suggestions to improve
efficiency would be highly appreciated.

Also, would you recommend using other factors like auto-correlation,
zero-crossing rate or energy alone be sufficient.

Following is the summary of the WAV file (only considering clean conversational
speech) i am using:

// WAV file information

Sampling Frequency: 44100 Bits Per Sample: 16

Channels: 2 nBlockAlign: 4 wavdata size: 557941248 bytes

Duration: 3162.932 sec Samples: 139485312 Time between samples: 0.0227
ms

Byte position at start of samples: 44 bytes (0x2C)

Chosen first sample to display: 1 (0.000 ms)

Chosen end sample to display: 1784 (40.431 ms)

16 bit max possible value is: 32767 (0x7FFF)

16 bit min possible value is: -32768 (0x8000)

Regards,

Vineet

# Issue implementing Energy threshold algorithm for Voice Activity Detection

Started by ●May 31, 2011

Reply by ●May 31, 20112011-05-31

Vineet-

> I am trying to implement the energy threshold algorithm for

> voice activity detection and not getting meaningful values

> for energy for frames of size wL.

>

> wL = 1784 // about 40 ms

> const double decay_constant = 0.90 // some optimal value

> between 0 and 1

> double prevrms = 1.0 // avoid DivideByZero

> double threshold = some optimal value after some experimentation

>

> for (int i = 0; i < noSamples ; i += wL)

> {

> for (int j = 0; j < wL; j++)

> {

> // Exponential decay

> total = total * decay_constant;

> total += (audioSample[j] * audioSample[j]); // sum of squares

> }

>

> double mean = total / wL;

> double rms = Math.Round(Math.Sqrt(mean),2); // root mean sqare

> double prevrms = 1.0;

>

> if(rms/prevrms > threshold)

> {

> // voice detected

> }

>

> prevrms = rms;

> rms = 0.0;

> }

>

> Please advise what is wrong with the above implementation

> as rms computed for every frame is calculated as 0.19.

I don't know which specific algorithm you're trying to implement, but just guessing it may be this:

y[n] = a*x[n] + b*y[n-1]

where a + b = 1. That will give you an exponential decay. In your case, you may want x[n] to be abs(x[n]) or

sqr(x[n]), and try a = 0.1 and b = 0.9.

Your code looks similar, but you have no coefficient for your input term, which leads me to guess that "total" in your

code will not decay, or at least not properly. Unless a + b = 1, then I believe you have an unstable situation.

> The other issue is speed as it took about 30 minutes to

> execute the above. Currently implemented as O(n2). Working

> with offline data so not a big deal as achieving a

> accuracy is the main objective ut any suggestions to improve

> efficiency would be highly appreciated.

30 minutes for approx 248 bil multiplies... well, could be. What type of machine are you using? Your loop can be

parallelized -- did you try OpenMP?

-Jeff

> Also, would you recommend using other factors like

> auto-correlation, zero-crossing rate or energy alone be

> sufficient.

>

> Following is the summary of the WAV file (only considering

> clean conversational speech) i am using:

>

> // WAV file information

> Sampling Frequency: 44100 Bits Per Sample: 16

> Channels: 2 nBlockAlign: 4 wavdata size: 557941248 bytes

> Duration: 3162.932 sec Samples: 139485312 Time between samples: 0.0227 ms

> Byte position at start of samples: 44 bytes (0x2C)

>

> Chosen first sample to display: 1 (0.000 ms)

> Chosen end sample to display: 1784 (40.431 ms)

>

> 16 bit max possible value is: 32767 (0x7FFF)

> 16 bit min possible value is: -32768 (0x8000)

>

> Regards,

>

> Vineet

> I am trying to implement the energy threshold algorithm for

> voice activity detection and not getting meaningful values

> for energy for frames of size wL.

>

> wL = 1784 // about 40 ms

> const double decay_constant = 0.90 // some optimal value

> between 0 and 1

> double prevrms = 1.0 // avoid DivideByZero

> double threshold = some optimal value after some experimentation

>

> for (int i = 0; i < noSamples ; i += wL)

> {

> for (int j = 0; j < wL; j++)

> {

> // Exponential decay

> total = total * decay_constant;

> total += (audioSample[j] * audioSample[j]); // sum of squares

> }

>

> double mean = total / wL;

> double rms = Math.Round(Math.Sqrt(mean),2); // root mean sqare

> double prevrms = 1.0;

>

> if(rms/prevrms > threshold)

> {

> // voice detected

> }

>

> prevrms = rms;

> rms = 0.0;

> }

>

> Please advise what is wrong with the above implementation

> as rms computed for every frame is calculated as 0.19.

I don't know which specific algorithm you're trying to implement, but just guessing it may be this:

y[n] = a*x[n] + b*y[n-1]

where a + b = 1. That will give you an exponential decay. In your case, you may want x[n] to be abs(x[n]) or

sqr(x[n]), and try a = 0.1 and b = 0.9.

Your code looks similar, but you have no coefficient for your input term, which leads me to guess that "total" in your

code will not decay, or at least not properly. Unless a + b = 1, then I believe you have an unstable situation.

> The other issue is speed as it took about 30 minutes to

> execute the above. Currently implemented as O(n2). Working

> with offline data so not a big deal as achieving a

> accuracy is the main objective ut any suggestions to improve

> efficiency would be highly appreciated.

30 minutes for approx 248 bil multiplies... well, could be. What type of machine are you using? Your loop can be

parallelized -- did you try OpenMP?

-Jeff

> Also, would you recommend using other factors like

> auto-correlation, zero-crossing rate or energy alone be

> sufficient.

>

> Following is the summary of the WAV file (only considering

> clean conversational speech) i am using:

>

> // WAV file information

> Sampling Frequency: 44100 Bits Per Sample: 16

> Channels: 2 nBlockAlign: 4 wavdata size: 557941248 bytes

> Duration: 3162.932 sec Samples: 139485312 Time between samples: 0.0227 ms

> Byte position at start of samples: 44 bytes (0x2C)

>

> Chosen first sample to display: 1 (0.000 ms)

> Chosen end sample to display: 1784 (40.431 ms)

>

> 16 bit max possible value is: 32767 (0x7FFF)

> 16 bit min possible value is: -32768 (0x8000)

>

> Regards,

>

> Vineet

Reply by ●June 3, 20112011-06-03

I made a couple of edits to your code. This is what I would have done.

1.) Compute the energy of the signal in a given frame.

2.) Smooth that energy from frame to frame.

-Brant

On Sun, May 22, 2011 at 6:39 PM, wrote:

> Hi all

>

> I am trying to implement the energy threshold algorithm for voice activity

> detection and not getting meaningful values for energy for frames of size

> wL.

>

> wL = 1784 // about 40 ms

> const double decay_constant = 0.90 // some optimal value between 0 and 1

> double prevrms = 1.0 // avoid DivideByZero

> double threshold = some optimal value after some experimentation

>

> totalEnergy = 0;

> for (int i = 0; i < noSamples ; i += wL)

> {

>

total = 0;

> for (int j = 0; j < wL; j++)

> {

> // Exponential decay

> total = total;

> total += (audioSample[j] * audioSample[j]); // sum of squares

> }

> totalEnergy = decay_constant * totalEnergy + (1 - decay_constant) * total;

> double mean = totalEnergy / wL;

> double rms = Math.Round(Math.Sqrt(mean),2); // root mean sqare

> double prevrms = 1.0;

>

> if(rms/prevrms > threshold)

> {

> // voice detected

> }

>

> prevrms = rms;

> rms = 0.0;

> }

>

> Please advise what is wrong with the above implementation as rms computed

> for every frame is calculated as 0.19.

>

> The other issue is speed as it took about 30 minutes to execute the above.

> Currently implemented as O(n2). Working with offline data so not a big deal

> as achieving a accuracy is the main objective ut any suggestions to improve

> efficiency would be highly appreciated.

>

> Also, would you recommend using other factors like auto-correlation,

> zero-crossing rate or energy alone be sufficient.

>

> Following is the summary of the WAV file (only considering clean

> conversational speech) i am using:

>

> // WAV file information

> Sampling Frequency: 44100 Bits Per Sample: 16

> Channels: 2 nBlockAlign: 4 wavdata size: 557941248 bytes

> Duration: 3162.932 sec Samples: 139485312 Time between samples: 0.0227 ms

> Byte position at start of samples: 44 bytes (0x2C)

>

> Chosen first sample to display: 1 (0.000 ms)

> Chosen end sample to display: 1784 (40.431 ms)

>

> 16 bit max possible value is: 32767 (0x7FFF)

> 16 bit min possible value is: -32768 (0x8000)

>

> Regards,

>

> Vineet

>

>

--

Brant Jameson

PhD Candidate

UC Santa Cruz Computer Engineering

http://people.ucsc.edu/~pheese

1.) Compute the energy of the signal in a given frame.

2.) Smooth that energy from frame to frame.

-Brant

On Sun, May 22, 2011 at 6:39 PM, wrote:

> Hi all

>

> I am trying to implement the energy threshold algorithm for voice activity

> detection and not getting meaningful values for energy for frames of size

> wL.

>

> wL = 1784 // about 40 ms

> const double decay_constant = 0.90 // some optimal value between 0 and 1

> double prevrms = 1.0 // avoid DivideByZero

> double threshold = some optimal value after some experimentation

>

> totalEnergy = 0;

> for (int i = 0; i < noSamples ; i += wL)

> {

>

total = 0;

> for (int j = 0; j < wL; j++)

> {

> // Exponential decay

> total = total;

> total += (audioSample[j] * audioSample[j]); // sum of squares

> }

> totalEnergy = decay_constant * totalEnergy + (1 - decay_constant) * total;

> double mean = totalEnergy / wL;

> double rms = Math.Round(Math.Sqrt(mean),2); // root mean sqare

> double prevrms = 1.0;

>

> if(rms/prevrms > threshold)

> {

> // voice detected

> }

>

> prevrms = rms;

> rms = 0.0;

> }

>

> Please advise what is wrong with the above implementation as rms computed

> for every frame is calculated as 0.19.

>

> The other issue is speed as it took about 30 minutes to execute the above.

> Currently implemented as O(n2). Working with offline data so not a big deal

> as achieving a accuracy is the main objective ut any suggestions to improve

> efficiency would be highly appreciated.

>

> Also, would you recommend using other factors like auto-correlation,

> zero-crossing rate or energy alone be sufficient.

>

> Following is the summary of the WAV file (only considering clean

> conversational speech) i am using:

>

> // WAV file information

> Sampling Frequency: 44100 Bits Per Sample: 16

> Channels: 2 nBlockAlign: 4 wavdata size: 557941248 bytes

> Duration: 3162.932 sec Samples: 139485312 Time between samples: 0.0227 ms

> Byte position at start of samples: 44 bytes (0x2C)

>

> Chosen first sample to display: 1 (0.000 ms)

> Chosen end sample to display: 1784 (40.431 ms)

>

> 16 bit max possible value is: 32767 (0x7FFF)

> 16 bit min possible value is: -32768 (0x8000)

>

> Regards,

>

> Vineet

>

>

--

Brant Jameson

PhD Candidate

UC Santa Cruz Computer Engineering

http://people.ucsc.edu/~pheese

Reply by ●June 3, 20112011-06-03

Hi Jeff/Brant

Thanks a lot for your suggestions. I will revise the algorithm and post my

findings.

I also found another major flaw in my code where my second for loop *for

(int j = 0; j < wL; j++)* was not setup correctly which is why i was getting

the same value as it was going over the same sample values over and over

again.

Basically, the second for loop should be something like this:

*for(j = i; j <= i + wL ;j++)

*

*Machine Information*

System Type x64-based PC

Processor Pentium(R) Dual-Core CPU T4400 @ 2.20GHz, 2200 Mhz, 2

Core(s), 2 Logical Processor(s)

Installed Physical Memory (RAM) 4.00 GB

I have never used OpenMP but have knowledge using MPI. I will work on this

and update the post.

Regards

Vineet

On Wed, Jun 1, 2011 at 9:34 AM, Jeff Brower wrote:

> Vineet-

>

> > I am trying to implement the energy threshold algorithm for

> > voice activity detection and not getting meaningful values

> > for energy for frames of size wL.

> >

> > wL = 1784 // about 40 ms

> > const double decay_constant = 0.90 // some optimal value

> > between 0 and 1

> > double prevrms = 1.0 // avoid DivideByZero

> > double threshold = some optimal value after some experimentation

> >

> > for (int i = 0; i < noSamples ; i += wL)

> > {

> > for (int j = 0; j < wL; j++)

> > {

> > // Exponential decay

> > total = total * decay_constant;

> > total += (audioSample[j] * audioSample[j]); // sum of squares

> > }

> >

> > double mean = total / wL;

> > double rms = Math.Round(Math.Sqrt(mean),2); // root mean sqare

> > double prevrms = 1.0;

> >

> > if(rms/prevrms > threshold)

> > {

> > // voice detected

> > }

> >

> > prevrms = rms;

> > rms = 0.0;

> > }

> >

> > Please advise what is wrong with the above implementation

> > as rms computed for every frame is calculated as 0.19.

>

> I don't know which specific algorithm you're trying to implement, but just

> guessing it may be this:

>

> y[n] = a*x[n] + b*y[n-1]

>

> where a + b = 1. That will give you an exponential decay. In your case,

> you may want x[n] to be abs(x[n]) or

> sqr(x[n]), and try a = 0.1 and b = 0.9.

>

> Your code looks similar, but you have no coefficient for your input term,

> which leads me to guess that "total" in your

> code will not decay, or at least not properly. Unless a + b = 1, then I

> believe you have an unstable situation.

>

> > The other issue is speed as it took about 30 minutes to

> > execute the above. Currently implemented as O(n2). Working

> > with offline data so not a big deal as achieving a

> > accuracy is the main objective ut any suggestions to improve

> > efficiency would be highly appreciated.

>

> 30 minutes for approx 248 bil multiplies... well, could be. What type of

> machine are you using? Your loop can be

> parallelized -- did you try OpenMP?

>

> -Jeff

>

> > Also, would you recommend using other factors like

> > auto-correlation, zero-crossing rate or energy alone be

> > sufficient.

> >

> > Following is the summary of the WAV file (only considering

> > clean conversational speech) i am using:

> >

> > // WAV file information

> > Sampling Frequency: 44100 Bits Per Sample: 16

> > Channels: 2 nBlockAlign: 4 wavdata size: 557941248 bytes

> > Duration: 3162.932 sec Samples: 139485312 Time between samples:

> 0.0227 ms

> > Byte position at start of samples: 44 bytes (0x2C)

> >

> > Chosen first sample to display: 1 (0.000 ms)

> > Chosen end sample to display: 1784 (40.431 ms)

> >

> > 16 bit max possible value is: 32767 (0x7FFF)

> > 16 bit min possible value is: -32768 (0x8000)

> >

> > Regards,

> >

> > Vineet

Thanks a lot for your suggestions. I will revise the algorithm and post my

findings.

I also found another major flaw in my code where my second for loop *for

(int j = 0; j < wL; j++)* was not setup correctly which is why i was getting

the same value as it was going over the same sample values over and over

again.

Basically, the second for loop should be something like this:

*for(j = i; j <= i + wL ;j++)

*

*Machine Information*

System Type x64-based PC

Processor Pentium(R) Dual-Core CPU T4400 @ 2.20GHz, 2200 Mhz, 2

Core(s), 2 Logical Processor(s)

Installed Physical Memory (RAM) 4.00 GB

I have never used OpenMP but have knowledge using MPI. I will work on this

and update the post.

Regards

Vineet

On Wed, Jun 1, 2011 at 9:34 AM, Jeff Brower wrote:

> Vineet-

>

> > I am trying to implement the energy threshold algorithm for

> > voice activity detection and not getting meaningful values

> > for energy for frames of size wL.

> >

> > wL = 1784 // about 40 ms

> > const double decay_constant = 0.90 // some optimal value

> > between 0 and 1

> > double prevrms = 1.0 // avoid DivideByZero

> > double threshold = some optimal value after some experimentation

> >

> > for (int i = 0; i < noSamples ; i += wL)

> > {

> > for (int j = 0; j < wL; j++)

> > {

> > // Exponential decay

> > total = total * decay_constant;

> > total += (audioSample[j] * audioSample[j]); // sum of squares

> > }

> >

> > double mean = total / wL;

> > double rms = Math.Round(Math.Sqrt(mean),2); // root mean sqare

> > double prevrms = 1.0;

> >

> > if(rms/prevrms > threshold)

> > {

> > // voice detected

> > }

> >

> > prevrms = rms;

> > rms = 0.0;

> > }

> >

> > Please advise what is wrong with the above implementation

> > as rms computed for every frame is calculated as 0.19.

>

> I don't know which specific algorithm you're trying to implement, but just

> guessing it may be this:

>

> y[n] = a*x[n] + b*y[n-1]

>

> where a + b = 1. That will give you an exponential decay. In your case,

> you may want x[n] to be abs(x[n]) or

> sqr(x[n]), and try a = 0.1 and b = 0.9.

>

> Your code looks similar, but you have no coefficient for your input term,

> which leads me to guess that "total" in your

> code will not decay, or at least not properly. Unless a + b = 1, then I

> believe you have an unstable situation.

>

> > The other issue is speed as it took about 30 minutes to

> > execute the above. Currently implemented as O(n2). Working

> > with offline data so not a big deal as achieving a

> > accuracy is the main objective ut any suggestions to improve

> > efficiency would be highly appreciated.

>

> 30 minutes for approx 248 bil multiplies... well, could be. What type of

> machine are you using? Your loop can be

> parallelized -- did you try OpenMP?

>

> -Jeff

>

> > Also, would you recommend using other factors like

> > auto-correlation, zero-crossing rate or energy alone be

> > sufficient.

> >

> > Following is the summary of the WAV file (only considering

> > clean conversational speech) i am using:

> >

> > // WAV file information

> > Sampling Frequency: 44100 Bits Per Sample: 16

> > Channels: 2 nBlockAlign: 4 wavdata size: 557941248 bytes

> > Duration: 3162.932 sec Samples: 139485312 Time between samples:

> 0.0227 ms

> > Byte position at start of samples: 44 bytes (0x2C)

> >

> > Chosen first sample to display: 1 (0.000 ms)

> > Chosen end sample to display: 1784 (40.431 ms)

> >

> > 16 bit max possible value is: 32767 (0x7FFF)

> > 16 bit min possible value is: -32768 (0x8000)

> >

> > Regards,

> >

> > Vineet