DSPRelated.com
Forums

Voice Activity Detection (VAD)

Started by Tony Robinson September 21, 2007
On 2007-09-21, Tony Robinson <tonyr@delThisBit.cantabResearch.com> wrote:
> For a couple of reasons my attention has been drawn to voice activity > detection (speech/non-speech detection).
Probably as a side effect of NIST and/or the CHIL project, you'll find most of the recent scientific papers on the subject under the name "Speech Activity Detection" (SAD) instead of VAD. At least when in an ASR-environment instead of codec. OG.
Olivier Galibert <galibert@pobox.com> writes:

> On 2007-09-21, Tony Robinson <tonyr@delThisBit.cantabResearch.com> wrote: >> For a couple of reasons my attention has been drawn to voice activity >> detection (speech/non-speech detection). > > Probably as a side effect of NIST and/or the CHIL project, you'll find > most of the recent scientific papers on the subject under the name > "Speech Activity Detection" (SAD) instead of VAD. At least when in an > ASR-environment instead of codec.
Thanks. That's thrown up a few papers I hadn't found before, even if the majority use some pretrained HMM based solution. I'll read these over the next few days. Digging through an huge pile of papers at the weekend I came across: "Efficient voice activity detection algorithms using long-term speech information" by Javier Ram&iacute;rez, Jos&eacute; C. Segura, Carmen Ben&iacute;tez, &Aacute;ngel de la Torre and Antonio Rubio (Speech Communication, Volume 42, Issues 3-4, April 2004, Pages 271-287). This is pretty much what I was after - a very simple signal processing technique that can use the FFTs I need to do later anyway, fairly low latecy, not too many fudge factors, evaluated against G.729, adaptive multi-rate and advanced front end under speech recognition conditions. I'll read the non-pretrained-HMM papers and compare. Tony
Hi, Tony,

We have studied the LTSD-based VAD by Ramirez & al, though on the speaker 
verification task. The method works quite nice, but there is one point 
which needs caution. The method initializes a noise spectrum estimate from 
the beginning of the speech stream/file. If this initial period is real 
speech instead of noise, the method can, and usually does, fail 
completely. We fixed this problem by giving an external manually 
extracted "noise initialization file", matching in channel conditions/SNR 
with the evaluation conditions. But I guess the problem of initializing the noise model 
exists for other VADs as well.

Good luck,
Tomi

In comp.speech.research Tony Robinson <tonyr@delthisbitk.cantabresearch.com> wrote:
: 
: Olivier Galibert <galibert@pobox.com> writes:
: 
:> On 2007-09-21, Tony Robinson <tonyr@delThisBit.cantabResearch.com> wrote:
:>> For a couple of reasons my attention has been drawn to voice activity
:>> detection (speech/non-speech detection).
:>
:> Probably as a side effect of NIST and/or the CHIL project, you'll find
:> most of the recent scientific papers on the subject under the name
:> "Speech Activity Detection" (SAD) instead of VAD.  At least when in an
:> ASR-environment instead of codec.
: 
: Thanks.   That's thrown up a few papers I hadn't found before, even if
: the majority use some pretrained HMM based solution.   I'll read these
: over the next few days.
: 
: Digging through an huge pile of papers at the weekend I came across:
: 
: "Efficient voice activity detection algorithms using long-term speech
: information" by Javier Ram&iacute;rez, Jos&eacute; C. Segura, Carmen Ben&iacute;tez, &#4294967295;?ngel de
: la Torre and Antonio Rubio (Speech Communication, Volume 42, Issues 3-4,
: April 2004, Pages 271-287).
: 
: This is pretty much what I was after - a very simple signal processing
: technique that can use the FFTs I need to do later anyway, fairly low
: latecy, not too many fudge factors, evaluated against G.729, adaptive
: multi-rate and advanced front end under speech recognition conditions.
: 
: I'll read the non-pretrained-HMM papers and compare.
: 
: 
: Tony
: 

-- 
Tomi Kinnunen <tkinnu@cs.joensuu.fi> writes:

> We have studied the LTSD-based VAD by Ramirez & al, though on the speaker > verification task. The method works quite nice, but there is one point > which needs caution. The method initializes a noise spectrum estimate from > the beginning of the speech stream/file.
Thanks Tomi - I did see that and thought I'd have to implement another initialisation strategy (probably thruogh an initial buffer and picking the low power frmaes as noise). I was also a little worried about the Aurora task as it's quite structured with regular spacings of words and noise - my real life data is nothing like that (different levels for different speakers, changing noise conditions, music in background, etc). Very good to hear that you got it going, that reduces my risk. Thanks for the feedback. Tony
On Sep 24, 11:06 am, Tony Robinson
<to...@delThisBitk.cantabResearch.com> wrote:

> Digging through an huge pile of papers at the weekend I came across: > > "Efficient voice activity detection algorithms using long-term speech > information" by Javier Ram=EDrez, Jos=E9 C. Segura, Carmen Ben=EDtez, =C1=
ngel de
> la Torre and Antonio Rubio (Speech Communication, Volume 42, Issues 3-4, > April 2004, Pages 271-287). > > This is pretty much what I was after - a very simple signal processing > technique that can use the FFTs I need to do later anyway, fairly low > latecy, not too many fudge factors, evaluated against G.729, adaptive > multi-rate and advanced front end under speech recognition conditions.
In yesterday's mail I found: J. Ramirez, J.C. Segura, J.M. Gorriz, and L=2E Garcia, "Improved Voice Activity Detection Using Contextual Multiple Hypothesis Testing for Robust Speech Recognition," IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 8 (Nov. 2007), 2177-2189. Haven't read it yet, but the abstract indicates it presents a generalization of the method in the above-referenced work. cheers, jerry
Hello everyone.

I  am also doing a research on VAD. I have no background on speech
processing so I apologize if my comments sound dumb.

If your concern is that many current techniques are based on
pretrained 2-states HMM and dependent on labeled database, I recommend
reading this paper Auto-Segmentation Based Partitioning and Clustering
Approach to Robust Endpointing, Shi, Soong, Zhou. I guess this one is
published either in 2006 or 2007.
This technique attempts to segment the sound signal using some
homegeneity criterion. The boundaries of each segment are cues of
speech`s edges. Because of the nature of segmentation, this technique
is independent of speech feature.

Martin

On Oct 26, 1:48 am, martin....@gmail.com wrote:
> pretrained 2-states HMM and dependent on labeled database, I recommend > reading this paper Auto-Segmentation Based Partitioning and Clustering > Approach to Robust Endpointing, Shi, Soong, Zhou.
FYI, this paper appeared in ICASSP 2006. The reference is: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=34757&arnumber=1660140&count=322&index=207 Google shows that it's available at: http://research.microsoft.com/users/yushi/publications/SpeechRelatedPapers/0100793.pdf