Forums

Voice Activity Detection (VAD)

Started by Tony Robinson September 21, 2007
For a couple of reasons my attention has been drawn to voice activity
detection (speech/non-speech detection).

The speechies (and I count myself in that camp) tend to use tools that
speechies know, and do something like train up a two state HMM with
mixture Gaussian densities for each state and do a Viterbi decode to
decide what is speech and what is not.

However, that assumes that the noise seen in future is modelled by the
noise seen in training, and I'd like to get away from that as much as I
can.

http://en.wikipedia.org/wiki/Voice_activity_detection gives some
references, one of which is online and I would guess shares authorship
with the wikipedia article.

In my mind, what characterises speech over background noise is that
there are voiced portions which have significantly more power than the
background.  I'm happy with this assumption as I always expect the
speech power to be greater than the noise power during voiced speech and
I don't expect the noise to be periodic (i.e. I'm happy that it doesn't
work with siren noise).

So I feel that a good voice activity detector would be based on a voiced
speech detector.  The boundaries of the speech would be extended from
the voiced periods by some means, even if this is as simple as extending
by 100ms I think it would cover a lot of cases.

So, my questions to the comp.speech.research and comp.dsp camps are:

1) what algorithms are state of the art in voice activity detection?
2) has the voiced-speech detection approach been published?

It all seems like simple stuff, yet what little literature there is
tends to be somewhat heuristic and I have my doubts about it working
over a range of speech and noise conditions.


Tony



 

Tony Robinson wrote:


> So, my questions to the comp.speech.research and comp.dsp camps are: > > 1) what algorithms are state of the art in voice activity detection? > 2) has the voiced-speech detection approach been published? > > It all seems like simple stuff, yet what little literature there is > tends to be somewhat heuristic and I have my doubts about it working > over a range of speech and noise conditions.
The key question is how fast should be the VAD response. Or if the processing has to be in the real time or you are processing the recorded speech. It is not a problem to distinguish a speech from non speech signals if you can have a sliding observation window of several seconds. That allows accumulating for the enough statistics. There is not much you can do if the voice/non-voice decision has to be made in 10 msec. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
Vladimir Vassilevsky wrote:
> > > Tony Robinson wrote: > > >> So, my questions to the comp.speech.research and comp.dsp camps are: >> >> 1) what algorithms are state of the art in voice activity detection? >> 2) has the voiced-speech detection approach been published? >> >> It all seems like simple stuff, yet what little literature there is >> tends to be somewhat heuristic and I have my doubts about it working >> over a range of speech and noise conditions. > > The key question is how fast should be the VAD response. Or if the > processing has to be in the real time or you are processing the recorded > speech. It is not a problem to distinguish a speech from non speech > signals if you can have a sliding observation window of several seconds. > That allows accumulating for the enough statistics. There is not much > you can do if the voice/non-voice decision has to be made in 10 msec.
Or put another way: You can do a really good job in a speech recognition system, but doing it in speech codecs like G.729 is tough. They recently changed the VAD in G.729 to try to improve it. It still has problems, but I guess that is as good as anyone gets right now with low latency. Regards, Steve
Steve Underwood <steveu@dis.org> writes:

> They recently changed the VAD in G.729 to try to improve it.
Where is G.729 used? VOIP? -- % Randy Yates % "I met someone who looks alot like you, %% Fuquay-Varina, NC % she does the things you do, %%% 919-577-9882 % but she is an IBM." %%%% <yates@ieee.org> % 'Yours Truly, 2095', *Time*, ELO http://www.digitalsignallabs.com
Randy Yates wrote:
> Steve Underwood <steveu@dis.org> writes: > >> They recently changed the VAD in G.729 to try to improve it. > > Where is G.729 used? VOIP?
Mostly. The documentation implies it was originally designed for wireless applications, but VoIP is its main application in the real world. The choice of codecs is mostly based on how screwed up the patent licencing is. Without that roadblock I think most VoIP would be AMR. Now, G.729 is well enough entrenched that you tend to need G.729 to talk to another VoIP box. Steve
Vladimir Vassilevsky <antispam_bogus@hotmail.com> writes:

> The key question is how fast should be the VAD response. Or if the > processing has to be in the real time or you are processing the recorded > speech. It is not a problem to distinguish a speech from non speech > signals if you can have a sliding observation window of several > seconds.
Several seconds latency is fine for the couple of applications I have in mind - one of which is speech recognition. Tony
Steve Underwood <steveu@dis.org> writes:

> Or put another way: You can do a really good job in a speech recognition > system
That's what I'm after - any references? Tony
minfitlike@yahoo.co.uk writes:

> Using power or variance threshold is not a good idea in all but the > simplest of cases. If the noise is stationary then you may get away > with it but seldon is this the case and what if the noise power is > higher than the signal?
I agree that a good solution has to work in non-stationary noise. I didn't advocate a variance threshold. I'm happy that the noise power is less than the voiced parts of the signal. One of the applications is speech recognition, and if the noise power is greater than the vowel energies then I know from practical experience I'll not get any useful speech recognition results.
> The hardest problem is that of separating one > speaker from another using a VAD eg for a car voice recognition system > - the second speaker being the noise (or a radio of course).
This is indeed tricky - in fact I'm prepared to allow quiet clean voice from a radio as part of the speech signal as several speech recognition applications are conversational, and then one speaker is often very much attenuated compared with the other.
> In such circumstances it is best to use a geometric approach and > calculate time-delays using (say) the PHAT algorithm. By working out > the delay to say 3 microphones (two is not unique having front-back > ambiguity but can be made to work) you can define a zone of activity > in front of the desired speech.
There are occasions where I have two microphones, but I've very little control over the hardware over the hardware and it's mostly a single channel.
> The speed of operation is quite another matter - the PHAT algorithm is > a form of Generalized Cross-correlation and uses > FFTs. Quite do-able but maybe not fast enough for your application. As > the speed of processors increase we will see more sophisticated VADs
Yes, I should have mentioned that I'll have a processor quite capable of doing FFTs. My idea at the moment is to do autocorrelation with a couple of FFTs to find high powered periodic sections and then assume these are voiced speech. Stuff that is a long way away from any voiced speech can be assumed to be noise. I'll then build some form of statistical model (there I go again as a speechie) and make a Viterbi boundary between what I think is speech and noise. So several seconds of latency, FFTs on ~50ms windows and Viterbi are in my computational budget. This puts me a long way from the VAD algorithms used in VOIP and the like, so I'd guess that a different class of techniques that could use the longer latency available and greater computational power would do better that the published techniques for VAD for telephony. Tony
"Steve Underwood" <steveu@dis.org> wrote in message
news:fd1qfm$kq0$1@nnews.pacific.net.hk...

> They recently > changed the VAD in G.729 to try to improve it. It still has problems, > but I guess that is as good as anyone gets right now with low latency.
Steve, Can you please provide some details on that? I know the VAD algorithm which is typically used in the voice codecs; did they change it substantially or is it just tweaking of the values? VLV
How is it done ?

Look for patents bearing the G10L11/02 IC or ECLA code...
;-)