Forums

Re: Issue implementing Energy threshold algorithm forVoice Activity Detection

Started by Jeff Brower June 1, 2011
Vineet-

> Thanks a lot for your suggestions. I will revise the algorithm and post my
> findings.
>
> I also found another major flaw in my code where my second for loop *for
> (int j = 0; j < wL; j++)* was not setup correctly which is why i was getting
> the same value as it was going over the same sample values over and over
> again.
>
> Basically, the second for loop should be something like this:
>
> *for(j = i; j <= i + wL ;j++)
> *

Ok that's a fix. What about a multiplier for your input term? What you can try is feed your algorithm worst-case
input (i.e. a consant flat line with max amplitude) and make sure output decays as you expect.

> *Machine Information*
>
> System Type x64-based PC
> Processor Pentium(R) Dual-Core CPU T4400 @ 2.20GHz, 2200 Mhz, 2
> Core(s), 2 Logical Processor(s)
> Installed Physical Memory (RAM) 4.00 GB
>
> I have never used OpenMP but have knowledge using MPI. I will work on this
> and update the post.

On your current server, you should see a 2x performance increase when you parallelize your loop with OpenMP. With a
quad core (or higher) server, performance increase will be proportionally higher. For compute-intensive applications,
the number of physical cores makes the difference; hyperthreading isn't going to help much.

-Jeff

> On Wed, Jun 1, 2011 at 9:34 AM, Jeff Brower wrote:
>
>> Vineet-
>>
>> > I am trying to implement the energy threshold algorithm for
>> > voice activity detection and not getting meaningful values
>> > for energy for frames of size wL.
>> >
>> > wL = 1784 // about 40 ms
>> > const double decay_constant = 0.90 // some optimal value
>> > between 0 and 1
>> > double prevrms = 1.0 // avoid DivideByZero
>> > double threshold = some optimal value after some experimentation
>> >
>> > for (int i = 0; i < noSamples ; i += wL)
>> > {
>> > for (int j = 0; j < wL; j++)
>> > {
>> > // Exponential decay
>> > total = total * decay_constant;
>> > total += (audioSample[j] * audioSample[j]); // sum of squares
>> > }
>> >
>> > double mean = total / wL;
>> > double rms = Math.Round(Math.Sqrt(mean),2); // root mean sqare
>> > double prevrms = 1.0;
>> >
>> > if(rms/prevrms > threshold)
>> > {
>> > // voice detected
>> > }
>> >
>> > prevrms = rms;
>> > rms = 0.0;
>> > }
>> >
>> > Please advise what is wrong with the above implementation
>> > as rms computed for every frame is calculated as 0.19.
>>
>> I don't know which specific algorithm you're trying to implement, but just
>> guessing it may be this:
>>
>> y[n] = a*x[n] + b*y[n-1]
>>
>> where a + b = 1. That will give you an exponential decay. In your case,
>> you may want x[n] to be abs(x[n]) or
>> sqr(x[n]), and try a = 0.1 and b = 0.9.
>>
>> Your code looks similar, but you have no coefficient for your input term,
>> which leads me to guess that "total" in your
>> code will not decay, or at least not properly. Unless a + b = 1, then I
>> believe you have an unstable situation.
>>
>> > The other issue is speed as it took about 30 minutes to
>> > execute the above. Currently implemented as O(n2). Working
>> > with offline data so not a big deal as achieving a
>> > accuracy is the main objective ut any suggestions to improve
>> > efficiency would be highly appreciated.
>>
>> 30 minutes for approx 248 bil multiplies... well, could be. What type of
>> machine are you using? Your loop can be
>> parallelized -- did you try OpenMP?
>>
>> -Jeff
>>
>> > Also, would you recommend using other factors like
>> > auto-correlation, zero-crossing rate or energy alone be
>> > sufficient.
>> >
>> > Following is the summary of the WAV file (only considering
>> > clean conversational speech) i am using:
>> >
>> > // WAV file information
>> > Sampling Frequency: 44100 Bits Per Sample: 16
>> > Channels: 2 nBlockAlign: 4 wavdata size: 557941248 bytes
>> > Duration: 3162.932 sec Samples: 139485312 Time between samples:
>> 0.0227 ms
>> > Byte position at start of samples: 44 bytes (0x2C)
>> >
>> > Chosen first sample to display: 1 (0.000 ms)
>> > Chosen end sample to display: 1784 (40.431 ms)
>> >
>> > 16 bit max possible value is: 32767 (0x7FFF)
>> > 16 bit min possible value is: -32768 (0x8000)
>> >
>> > Regards,
>> >
>> > Vineet
>>
>
Thanks Jeff

I corrected my code and ran series of experiments but i am having trouble
analyzing the data to get optimal values of the window size, decay constant
value and threshold value for determining voice in and out. I am only
working with clean mono audio. The experiments done are as follows:

frame size = 441 (10ms) [for decay constant 0.1 to 0.9]
frame size = 882 (20ms) [for decay constant 0.1 to 0.9]
frame size = 1323 (30ms) [for decay constant 0.1 to 0.9]
frame size = 1764 (40ms) [for decay constant 0.1 to 0.9]
frame size = 2205 (50ms) [for decay constant 0.1 to 0.9]

I have manually annotated the audio file with sample number for voice
in/out. Framesin.txt represents sample number for voice in and
Framesout.txt represents sample numbers for voice out.

I have uploaded sample raw data file with frame size = 441 (10 ms), decay
constant = 0.1.
http://www.fileserve.com/file/GYTPZ5C

Following piece of code for determining voice in/out:
tot = rms ratio; prevtot = prev rms ratio

if (statein == false && tot > 1.3*prevtot && tot < 2.3*prevtot)
{
statein = true;
// Voice in
}
if (statein == true && prevtot > 1.3 * tot && prevtot < 2.3 * tot)
{
// Voice out
statein = false;
}

Let me know if you need more data files. Also, are there any free clean
speech databases (corpus) available for comparison purposes as the only ones
i could find are noise specific (noise-ex, timit etc) which i am not
interested.

Regards,

Vineet

On Thu, Jun 2, 2011 at 9:17 AM, Jeff Brower wrote:

> Vineet-
>
> > Thanks a lot for your suggestions. I will revise the algorithm and post
> my
> > findings.
> >
> > I also found another major flaw in my code where my second for loop *for
> > (int j = 0; j < wL; j++)* was not setup correctly which is why i was
> getting
> > the same value as it was going over the same sample values over and over
> > again.
> >
> > Basically, the second for loop should be something like this:
> >
> > *for(j = i; j <= i + wL ;j++)
> > *
>
> Ok that's a fix. What about a multiplier for your input term? What you
> can try is feed your algorithm worst-case
> input (i.e. a consant flat line with max amplitude) and make sure output
> decays as you expect.
>
> > *Machine Information*
> >
> > System Type x64-based PC
> > Processor Pentium(R) Dual-Core CPU T4400 @ 2.20GHz, 2200 Mhz, 2
> > Core(s), 2 Logical Processor(s)
> > Installed Physical Memory (RAM) 4.00 GB
> >
> > I have never used OpenMP but have knowledge using MPI. I will work on
> this
> > and update the post.
>
> On your current server, you should see a 2x performance increase when you
> parallelize your loop with OpenMP. With a
> quad core (or higher) server, performance increase will be proportionally
> higher. For compute-intensive applications,
> the number of physical cores makes the difference; hyperthreading isn't
> going to help much.
>
> -Jeff
>
> > On Wed, Jun 1, 2011 at 9:34 AM, Jeff Brower
> wrote:
> >
> >> Vineet-
> >>
> >> > I am trying to implement the energy threshold algorithm for
> >> > voice activity detection and not getting meaningful values
> >> > for energy for frames of size wL.
> >> >
> >> > wL = 1784 // about 40 ms
> >> > const double decay_constant = 0.90 // some optimal value
> >> > between 0 and 1
> >> > double prevrms = 1.0 // avoid DivideByZero
> >> > double threshold = some optimal value after some experimentation
> >> >
> >> > for (int i = 0; i < noSamples ; i += wL)
> >> > {
> >> > for (int j = 0; j < wL; j++)
> >> > {
> >> > // Exponential decay
> >> > total = total * decay_constant;
> >> > total += (audioSample[j] * audioSample[j]); // sum of squares
> >> > }
> >> >
> >> > double mean = total / wL;
> >> > double rms = Math.Round(Math.Sqrt(mean),2); // root mean sqare
> >> > double prevrms = 1.0;
> >> >
> >> > if(rms/prevrms > threshold)
> >> > {
> >> > // voice detected
> >> > }
> >> >
> >> > prevrms = rms;
> >> > rms = 0.0;
> >> > }
> >> >
> >> > Please advise what is wrong with the above implementation
> >> > as rms computed for every frame is calculated as 0.19.
> >>
> >> I don't know which specific algorithm you're trying to implement, but
> just
> >> guessing it may be this:
> >>
> >> y[n] = a*x[n] + b*y[n-1]
> >>
> >> where a + b = 1. That will give you an exponential decay. In your
> case,
> >> you may want x[n] to be abs(x[n]) or
> >> sqr(x[n]), and try a = 0.1 and b = 0.9.
> >>
> >> Your code looks similar, but you have no coefficient for your input
> term,
> >> which leads me to guess that "total" in your
> >> code will not decay, or at least not properly. Unless a + b = 1, then I
> >> believe you have an unstable situation.
> >>
> >> > The other issue is speed as it took about 30 minutes to
> >> > execute the above. Currently implemented as O(n2). Working
> >> > with offline data so not a big deal as achieving a
> >> > accuracy is the main objective ut any suggestions to improve
> >> > efficiency would be highly appreciated.
> >>
> >> 30 minutes for approx 248 bil multiplies... well, could be. What type
> of
> >> machine are you using? Your loop can be
> >> parallelized -- did you try OpenMP?
> >>
> >> -Jeff
> >>
> >> > Also, would you recommend using other factors like
> >> > auto-correlation, zero-crossing rate or energy alone be
> >> > sufficient.
> >> >
> >> > Following is the summary of the WAV file (only considering
> >> > clean conversational speech) i am using:
> >> >
> >> > // WAV file information
> >> > Sampling Frequency: 44100 Bits Per Sample: 16
> >> > Channels: 2 nBlockAlign: 4 wavdata size: 557941248 bytes
> >> > Duration: 3162.932 sec Samples: 139485312 Time between samples:
> >> 0.0227 ms
> >> > Byte position at start of samples: 44 bytes (0x2C)
> >> >
> >> > Chosen first sample to display: 1 (0.000 ms)
> >> > Chosen end sample to display: 1784 (40.431 ms)
> >> >
> >> > 16 bit max possible value is: 32767 (0x7FFF)
> >> > 16 bit min possible value is: -32768 (0x8000)
> >> >
> >> > Regards,
> >> >
> >> > Vineet
> >>
> >>
> >