Detecting speech obscured by loud stationary noise

Started by October 21, 2008
```Hi,

I would like to detect human speech in the presence of a very loud, very
stationary external sound.  It doesn't require a fine-grained detection but
that would be nice.  It will ultimately be implemented in an FPGA.  This
isn't quite a negative SNR scenario because the "external sound" isn't
Gaussian white noise, but a periodic waveform.  So far I've learned that;

1) energy and power statistics aren't good for determining the presence or
absence of speech in noisy environments.  I may be able to watch the change
in those statistics and wait for "significant changes" but this still
requires some ballpark estimate of a threshold for "significant".

2) Watching for changes in the zero crossings statistic might provide
another insight, with a similar "define 'significant'" problem.

3) Maybe watch some kind of envelope statistics?  Number of peaks or
something?

4) I may end up doing an FFT anyway, so watching the change of certain
frequency bins is another option.

5) Some kind of correlation might do the trick.  The auto-correlation
should be high when no one is speaking and lower when someone is speaking.
I would prefer numerically cheap methods of calculating auto-correlation
(like Average [Magnitude/Square] Difference Function; I also believe
there's a way to do it with FFTs?).  I think this would require a rough
idea of the fundamental so that we could determine what a "low
auto-correlation" is.

6) I've read a little bit about Hidden Markov Models, but they seem rather
complicated and I've yet to find a source which explains it in terms that I
can grasp.  From what I gather, I could train an HMM to recognize the
external sound and from there it could tell me whether there is more than
just the external sound in the incoming signal.

Perhaps in the end, the best approach is a combination (calculate many
statistics and watch for a change in the majority of them).  I've read that
most of the energy in speech is below a few hundred Hz and a few kHz, so I
was considering a band-pass filter to make the statistics cleaner.

Are there any other suggestions?  Are there any pitfalls associated with
the ones above?  Can anyone explain in simpler terms how HMM works?

Cheers,
Silash

```
```On 21 Okt, 21:56, "Silash" <deadca...@gmail.com> wrote:
> Hi,
>
> I would like to detect human speech in the presence of a very loud, very
> stationary external sound. &#2013266080;It doesn't require a fine-grained detection but
> that would be nice. &#2013266080;It will ultimately be implemented in an FPGA. &#2013266080;This
> isn't quite a negative SNR scenario because the "external sound" isn't
> Gaussian white noise, but a periodic waveform.

What about a notch filter? Or a narrow band-pass filter which kills
off the noise? It seems you need to handle it somehow, so why not
filter it out? The speech signal in the noisy bandwidth is likely
lost anyway.

Rune
```
```>On 21 Okt, 21:56, "Silash" <deadca...@gmail.com> wrote:
>> Hi,
>>
>> I would like to detect human speech in the presence of a very loud,
very
>> stationary external sound. =A0It doesn't require a fine-grained
detection=
> but
>> that would be nice. =A0It will ultimately be implemented in an FPGA.
=A0T=
>his
>> isn't quite a negative SNR scenario because the "external sound" isn't
>> Gaussian white noise, but a periodic waveform.
>
>What about a notch filter? Or a narrow band-pass filter which kills
>off the noise? It seems you need to handle it somehow, so why not
>filter it out? The speech signal in the noisy bandwidth is likely
>lost anyway.
>
>Rune
>

It's not really "noise" in the traditional sense (hence my potential
auto-correlation suggestion).  I should have added that the external sound
has a wide bandwidth, so a notch filter won't work.

The time domain signal is magnificently periodic, to the point where you
can copy a period of the external sound and subtract it from subsequent
frames to recover the speech signal.  Normally I would just test the result
of the subtraction but as the person moves around (i.e. gets closer to the
microphone) [I think!] the phase of some frequencies changes, ultimately
reducing the effectiveness of time-domain cancellation.  I imagine that
small changes in phase lead to slow (and small) changes in the statistics,
while a person speaking leads to rapid (and big) changes in statistics,
hence my approach of watching for "significant changes" in something like
zero crossings or peaks or short-term energy...

I would like to update the "noise" frame on-the-fly to use the most recent
speech-free frame, minimizing accumulated phase error and resulting in
optimal cancellation.  If I use a frame containing speech, the subtracted
speech adds an echo to the incoming speech, leading the algorithm to
believe that the person is always speaking, resulting in no more updates to
the noise frame and a persistent, repeating echo in the output signal.

Cheers,
Silash
```