For a couple of reasons my attention has been drawn to voice activity detection (speech/non-speech detection). The speechies (and I count myself in that camp) tend to use tools that speechies know, and do something like train up a two state HMM with mixture Gaussian densities for each state and do a Viterbi decode to decide what is speech and what is not. However, that assumes that the noise seen in future is modelled by the noise seen in training, and I'd like to get away from that as much as I can. http://en.wikipedia.org/wiki/Voice_activity_detection gives some references, one of which is online and I would guess shares authorship with the wikipedia article. In my mind, what characterises speech over background noise is that there are voiced portions which have significantly more power than the background. I'm happy with this assumption as I always expect the speech power to be greater than the noise power during voiced speech and I don't expect the noise to be periodic (i.e. I'm happy that it doesn't work with siren noise). So I feel that a good voice activity detector would be based on a voiced speech detector. The boundaries of the speech would be extended from the voiced periods by some means, even if this is as simple as extending by 100ms I think it would cover a lot of cases. So, my questions to the comp.speech.research and comp.dsp camps are: 1) what algorithms are state of the art in voice activity detection? 2) has the voiced-speech detection approach been published? It all seems like simple stuff, yet what little literature there is tends to be somewhat heuristic and I have my doubts about it working over a range of speech and noise conditions. Tony
Voice Activity Detection (VAD)
Started by ●September 21, 2007
Reply by ●September 21, 20072007-09-21
Tony Robinson wrote:> So, my questions to the comp.speech.research and comp.dsp camps are: > > 1) what algorithms are state of the art in voice activity detection? > 2) has the voiced-speech detection approach been published? > > It all seems like simple stuff, yet what little literature there is > tends to be somewhat heuristic and I have my doubts about it working > over a range of speech and noise conditions.The key question is how fast should be the VAD response. Or if the processing has to be in the real time or you are processing the recorded speech. It is not a problem to distinguish a speech from non speech signals if you can have a sliding observation window of several seconds. That allows accumulating for the enough statistics. There is not much you can do if the voice/non-voice decision has to be made in 10 msec. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
Reply by ●September 21, 20072007-09-21
Vladimir Vassilevsky wrote:> > > Tony Robinson wrote: > > >> So, my questions to the comp.speech.research and comp.dsp camps are: >> >> 1) what algorithms are state of the art in voice activity detection? >> 2) has the voiced-speech detection approach been published? >> >> It all seems like simple stuff, yet what little literature there is >> tends to be somewhat heuristic and I have my doubts about it working >> over a range of speech and noise conditions. > > The key question is how fast should be the VAD response. Or if the > processing has to be in the real time or you are processing the recorded > speech. It is not a problem to distinguish a speech from non speech > signals if you can have a sliding observation window of several seconds. > That allows accumulating for the enough statistics. There is not much > you can do if the voice/non-voice decision has to be made in 10 msec.Or put another way: You can do a really good job in a speech recognition system, but doing it in speech codecs like G.729 is tough. They recently changed the VAD in G.729 to try to improve it. It still has problems, but I guess that is as good as anyone gets right now with low latency. Regards, Steve
Reply by ●September 21, 20072007-09-21
Steve Underwood <steveu@dis.org> writes:> They recently changed the VAD in G.729 to try to improve it.Where is G.729 used? VOIP? -- % Randy Yates % "I met someone who looks alot like you, %% Fuquay-Varina, NC % she does the things you do, %%% 919-577-9882 % but she is an IBM." %%%% <yates@ieee.org> % 'Yours Truly, 2095', *Time*, ELO http://www.digitalsignallabs.com
Reply by ●September 21, 20072007-09-21
Randy Yates wrote:> Steve Underwood <steveu@dis.org> writes: > >> They recently changed the VAD in G.729 to try to improve it. > > Where is G.729 used? VOIP?Mostly. The documentation implies it was originally designed for wireless applications, but VoIP is its main application in the real world. The choice of codecs is mostly based on how screwed up the patent licencing is. Without that roadblock I think most VoIP would be AMR. Now, G.729 is well enough entrenched that you tend to need G.729 to talk to another VoIP box. Steve
Reply by ●September 22, 20072007-09-22
Vladimir Vassilevsky <antispam_bogus@hotmail.com> writes:> The key question is how fast should be the VAD response. Or if the > processing has to be in the real time or you are processing the recorded > speech. It is not a problem to distinguish a speech from non speech > signals if you can have a sliding observation window of several > seconds.Several seconds latency is fine for the couple of applications I have in mind - one of which is speech recognition. Tony
Reply by ●September 22, 20072007-09-22
Steve Underwood <steveu@dis.org> writes:> Or put another way: You can do a really good job in a speech recognition > systemThat's what I'm after - any references? Tony
Reply by ●September 22, 20072007-09-22
minfitlike@yahoo.co.uk writes:> Using power or variance threshold is not a good idea in all but the > simplest of cases. If the noise is stationary then you may get away > with it but seldon is this the case and what if the noise power is > higher than the signal?I agree that a good solution has to work in non-stationary noise. I didn't advocate a variance threshold. I'm happy that the noise power is less than the voiced parts of the signal. One of the applications is speech recognition, and if the noise power is greater than the vowel energies then I know from practical experience I'll not get any useful speech recognition results.> The hardest problem is that of separating one > speaker from another using a VAD eg for a car voice recognition system > - the second speaker being the noise (or a radio of course).This is indeed tricky - in fact I'm prepared to allow quiet clean voice from a radio as part of the speech signal as several speech recognition applications are conversational, and then one speaker is often very much attenuated compared with the other.> In such circumstances it is best to use a geometric approach and > calculate time-delays using (say) the PHAT algorithm. By working out > the delay to say 3 microphones (two is not unique having front-back > ambiguity but can be made to work) you can define a zone of activity > in front of the desired speech.There are occasions where I have two microphones, but I've very little control over the hardware over the hardware and it's mostly a single channel.> The speed of operation is quite another matter - the PHAT algorithm is > a form of Generalized Cross-correlation and uses > FFTs. Quite do-able but maybe not fast enough for your application. As > the speed of processors increase we will see more sophisticated VADsYes, I should have mentioned that I'll have a processor quite capable of doing FFTs. My idea at the moment is to do autocorrelation with a couple of FFTs to find high powered periodic sections and then assume these are voiced speech. Stuff that is a long way away from any voiced speech can be assumed to be noise. I'll then build some form of statistical model (there I go again as a speechie) and make a Viterbi boundary between what I think is speech and noise. So several seconds of latency, FFTs on ~50ms windows and Viterbi are in my computational budget. This puts me a long way from the VAD algorithms used in VOIP and the like, so I'd guess that a different class of techniques that could use the longer latency available and greater computational power would do better that the published techniques for VAD for telephony. Tony
Reply by ●September 23, 20072007-09-23
"Steve Underwood" <steveu@dis.org> wrote in message news:fd1qfm$kq0$1@nnews.pacific.net.hk...> They recently > changed the VAD in G.729 to try to improve it. It still has problems, > but I guess that is as good as anyone gets right now with low latency.Steve, Can you please provide some details on that? I know the VAD algorithm which is typically used in the voice codecs; did they change it substantially or is it just tweaking of the values? VLV
Reply by ●September 24, 20072007-09-24