comp.dsp | Voice Activity Detection (VAD)| page 2

Reply by Olivier Galibert ●September 24, 20072007-09-24

On 2007-09-21, Tony Robinson <tonyr@delThisBit.cantabResearch.com> wrote:
> For a couple of reasons my attention has been drawn to voice activity
> detection (speech/non-speech detection).

Probably as a side effect of NIST and/or the CHIL project, you'll find
most of the recent scientific papers on the subject under the name
"Speech Activity Detection" (SAD) instead of VAD.  At least when in an
ASR-environment instead of codec.

  OG.

Reply by Tony Robinson ●September 24, 20072007-09-24

Olivier Galibert <galibert@pobox.com> writes:

> On 2007-09-21, Tony Robinson <tonyr@delThisBit.cantabResearch.com> wrote:
>> For a couple of reasons my attention has been drawn to voice activity
>> detection (speech/non-speech detection).
>
> Probably as a side effect of NIST and/or the CHIL project, you'll find
> most of the recent scientific papers on the subject under the name
> "Speech Activity Detection" (SAD) instead of VAD.  At least when in an
> ASR-environment instead of codec.

Thanks.   That's thrown up a few papers I hadn't found before, even if
the majority use some pretrained HMM based solution.   I'll read these
over the next few days.

Digging through an huge pile of papers at the weekend I came across:

"Efficient voice activity detection algorithms using long-term speech
information" by Javier Ram&iacute;rez, Jos&eacute; C. Segura, Carmen Ben&iacute;tez, &Aacute;ngel de
la Torre and Antonio Rubio (Speech Communication, Volume 42, Issues 3-4,
April 2004, Pages 271-287).

This is pretty much what I was after - a very simple signal processing
technique that can use the FFTs I need to do later anyway, fairly low
latecy, not too many fudge factors, evaluated against G.729, adaptive
multi-rate and advanced front end under speech recognition conditions.

I'll read the non-pretrained-HMM papers and compare.

Tony

Reply by Tomi Kinnunen ●September 24, 20072007-09-24

Hi, Tony,

We have studied the LTSD-based VAD by Ramirez & al, though on the speaker
verification task. The method works quite nice, but there is one point
which needs caution. The method initializes a noise spectrum estimate from
the beginning of the speech stream/file. If this initial period is real
speech instead of noise, the method can, and usually does, fail
completely. We fixed this problem by giving an external manually
extracted "noise initialization file", matching in channel conditions/SNR
with the evaluation conditions. But I guess the problem of initializing the noise model
exists for other VADs as well.

Good luck,
Tomi

In comp.speech.research Tony Robinson <tonyr@delthisbitk.cantabresearch.com> wrote:
:
: Olivier Galibert <galibert@pobox.com> writes:
:
:> On 2007-09-21, Tony Robinson <tonyr@delThisBit.cantabResearch.com> wrote:
:>> For a couple of reasons my attention has been drawn to voice activity
:>> detection (speech/non-speech detection).
:>
:> Probably as a side effect of NIST and/or the CHIL project, you'll find
:> most of the recent scientific papers on the subject under the name
:> "Speech Activity Detection" (SAD) instead of VAD. At least when in an
:> ASR-environment instead of codec.
:
: Thanks. That's thrown up a few papers I hadn't found before, even if
: the majority use some pretrained HMM based solution. I'll read these
: over the next few days.
:
: Digging through an huge pile of papers at the weekend I came across:
:
: "Efficient voice activity detection algorithms using long-term speech
: information" by Javier Ram&iacute;rez, Jos&eacute; C. Segura, Carmen Ben&iacute;tez, &#4294967295;?ngel de
: la Torre and Antonio Rubio (Speech Communication, Volume 42, Issues 3-4,
: April 2004, Pages 271-287).
:
: This is pretty much what I was after - a very simple signal processing
: technique that can use the FFTs I need to do later anyway, fairly low
: latecy, not too many fudge factors, evaluated against G.729, adaptive
: multi-rate and advanced front end under speech recognition conditions.
:
: I'll read the non-pretrained-HMM papers and compare.
:
:
: Tony
:

Reply by Tony Robinson ●September 25, 20072007-09-25

Tomi Kinnunen <tkinnu@cs.joensuu.fi> writes:

> We have studied the LTSD-based VAD by Ramirez & al, though on the speaker 
> verification task. The method works quite nice, but there is one point 
> which needs caution. The method initializes a noise spectrum estimate from 
> the beginning of the speech stream/file.

Thanks Tomi - I did see that and thought I'd have to implement another
initialisation strategy (probably thruogh an initial buffer and picking
the low power frmaes as noise).  I was also a little worried about the
Aurora task as it's quite structured with regular spacings of words and
noise - my real life data is nothing like that (different levels for
different speakers, changing noise conditions, music in background,
etc).  Very good to hear that you got it going, that reduces my risk.

Thanks for the feedback.

Tony

Reply by Jerry Wolf ●October 23, 20072007-10-23

On Sep 24, 11:06 am, Tony Robinson
<to...@delThisBitk.cantabResearch.com> wrote:

> Digging through an huge pile of papers at the weekend I came across:
>
> "Efficient voice activity detection algorithms using long-term speech
> information" by Javier Ram=EDrez, Jos=E9 C. Segura, Carmen Ben=EDtez, =C1=
ngel de
> la Torre and Antonio Rubio (Speech Communication, Volume 42, Issues 3-4,
> April 2004, Pages 271-287).
>
> This is pretty much what I was after - a very simple signal processing
> technique that can use the FFTs I need to do later anyway, fairly low
> latecy, not too many fudge factors, evaluated against G.729, adaptive
> multi-rate and advanced front end under speech recognition conditions.

In yesterday's mail I found: J. Ramirez, J.C. Segura, J.M. Gorriz, and
L=2E Garcia, "Improved Voice Activity Detection Using Contextual
Multiple Hypothesis Testing for Robust Speech Recognition," IEEE
Trans. on Audio, Speech, and Language Processing, vol. 15, no. 8 (Nov.
2007), 2177-2189.  Haven't read it yet, but the abstract indicates it
presents a generalization of the method in the above-referenced work.

cheers,
  jerry

Reply by ●October 26, 20072007-10-26

Hello everyone.

I  am also doing a research on VAD. I have no background on speech
processing so I apologize if my comments sound dumb.

If your concern is that many current techniques are based on
pretrained 2-states HMM and dependent on labeled database, I recommend
reading this paper Auto-Segmentation Based Partitioning and Clustering
Approach to Robust Endpointing, Shi, Soong, Zhou. I guess this one is
published either in 2006 or 2007.
This technique attempts to segment the sound signal using some
homegeneity criterion. The boundaries of each segment are cues of
speech`s edges. Because of the nature of segmentation, this technique
is independent of speech feature.

Martin

Reply by Jerry Wolf ●October 29, 20072007-10-29

On Oct 26, 1:48 am, martin....@gmail.com wrote:
> pretrained 2-states HMM and dependent on labeled database, I recommend
> reading this paper Auto-Segmentation Based Partitioning and Clustering
> Approach to Robust Endpointing, Shi, Soong, Zhou.

FYI, this paper appeared in ICASSP 2006.  The reference is:
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=34757&arnumber=1660140&count=322&index=207

Google shows that it's available at:
http://research.microsoft.com/users/yushi/publications/SpeechRelatedPapers/0100793.pdf

Previous 12Next

Voice Activity Detection (VAD)

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group