Forums

Detecting speech obscured by loud stationary noise

Started by Silash October 21, 2008
Hi,

I would like to detect human speech in the presence of a very loud, very
stationary external sound.  It doesn't require a fine-grained detection but
that would be nice.  It will ultimately be implemented in an FPGA.  This
isn't quite a negative SNR scenario because the "external sound" isn't
Gaussian white noise, but a periodic waveform.  So far I've learned that;

1) energy and power statistics aren't good for determining the presence or
absence of speech in noisy environments.  I may be able to watch the change
in those statistics and wait for "significant changes" but this still
requires some ballpark estimate of a threshold for "significant".

2) Watching for changes in the zero crossings statistic might provide
another insight, with a similar "define 'significant'" problem.

3) Maybe watch some kind of envelope statistics?  Number of peaks or
something?

4) I may end up doing an FFT anyway, so watching the change of certain
frequency bins is another option.

5) Some kind of correlation might do the trick.  The auto-correlation
should be high when no one is speaking and lower when someone is speaking. 
I would prefer numerically cheap methods of calculating auto-correlation
(like Average [Magnitude/Square] Difference Function; I also believe
there's a way to do it with FFTs?).  I think this would require a rough
idea of the fundamental so that we could determine what a "low
auto-correlation" is.

6) I've read a little bit about Hidden Markov Models, but they seem rather
complicated and I've yet to find a source which explains it in terms that I
can grasp.  From what I gather, I could train an HMM to recognize the
external sound and from there it could tell me whether there is more than
just the external sound in the incoming signal.

Perhaps in the end, the best approach is a combination (calculate many
statistics and watch for a change in the majority of them).  I've read that
most of the energy in speech is below a few hundred Hz and a few kHz, so I
was considering a band-pass filter to make the statistics cleaner.

Are there any other suggestions?  Are there any pitfalls associated with
the ones above?  Can anyone explain in simpler terms how HMM works?

Cheers,
Silash


On 21 Okt, 21:56, "Silash" <deadca...@gmail.com> wrote:
> Hi, > > I would like to detect human speech in the presence of a very loud, very > stationary external sound. &#2013266080;It doesn't require a fine-grained detection but > that would be nice. &#2013266080;It will ultimately be implemented in an FPGA. &#2013266080;This > isn't quite a negative SNR scenario because the "external sound" isn't > Gaussian white noise, but a periodic waveform.
What about a notch filter? Or a narrow band-pass filter which kills off the noise? It seems you need to handle it somehow, so why not filter it out? The speech signal in the noisy bandwidth is likely lost anyway. Rune
>On 21 Okt, 21:56, "Silash" <deadca...@gmail.com> wrote: >> Hi, >> >> I would like to detect human speech in the presence of a very loud,
very
>> stationary external sound. =A0It doesn't require a fine-grained
detection=
> but >> that would be nice. =A0It will ultimately be implemented in an FPGA.
=A0T=
>his >> isn't quite a negative SNR scenario because the "external sound" isn't >> Gaussian white noise, but a periodic waveform. > >What about a notch filter? Or a narrow band-pass filter which kills >off the noise? It seems you need to handle it somehow, so why not >filter it out? The speech signal in the noisy bandwidth is likely >lost anyway. > >Rune >
It's not really "noise" in the traditional sense (hence my potential auto-correlation suggestion). I should have added that the external sound has a wide bandwidth, so a notch filter won't work. The time domain signal is magnificently periodic, to the point where you can copy a period of the external sound and subtract it from subsequent frames to recover the speech signal. Normally I would just test the result of the subtraction but as the person moves around (i.e. gets closer to the microphone) [I think!] the phase of some frequencies changes, ultimately reducing the effectiveness of time-domain cancellation. I imagine that small changes in phase lead to slow (and small) changes in the statistics, while a person speaking leads to rapid (and big) changes in statistics, hence my approach of watching for "significant changes" in something like zero crossings or peaks or short-term energy... I would like to update the "noise" frame on-the-fly to use the most recent speech-free frame, minimizing accumulated phase error and resulting in optimal cancellation. If I use a frame containing speech, the subtracted speech adds an echo to the incoming speech, leading the algorithm to believe that the person is always speaking, resulting in no more updates to the noise frame and a persistent, repeating echo in the output signal. Cheers, Silash