Reply by Jerry Wolf October 29, 20072007-10-29
On Oct 26, 1:48 am, martin....@gmail.com wrote:
> pretrained 2-states HMM and dependent on labeled database, I recommend > reading this paper Auto-Segmentation Based Partitioning and Clustering > Approach to Robust Endpointing, Shi, Soong, Zhou.
FYI, this paper appeared in ICASSP 2006. The reference is: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=34757&arnumber=1660140&count=322&index=207 Google shows that it's available at: http://research.microsoft.com/users/yushi/publications/SpeechRelatedPapers/0100793.pdf
Reply by October 26, 20072007-10-26
Hello everyone.

I  am also doing a research on VAD. I have no background on speech
processing so I apologize if my comments sound dumb.

If your concern is that many current techniques are based on
pretrained 2-states HMM and dependent on labeled database, I recommend
reading this paper Auto-Segmentation Based Partitioning and Clustering
Approach to Robust Endpointing, Shi, Soong, Zhou. I guess this one is
published either in 2006 or 2007.
This technique attempts to segment the sound signal using some
homegeneity criterion. The boundaries of each segment are cues of
speech`s edges. Because of the nature of segmentation, this technique
is independent of speech feature.

Martin

Reply by Jerry Wolf October 23, 20072007-10-23
On Sep 24, 11:06 am, Tony Robinson
<to...@delThisBitk.cantabResearch.com> wrote:

> Digging through an huge pile of papers at the weekend I came across: > > "Efficient voice activity detection algorithms using long-term speech > information" by Javier Ram=EDrez, Jos=E9 C. Segura, Carmen Ben=EDtez, =C1=
ngel de
> la Torre and Antonio Rubio (Speech Communication, Volume 42, Issues 3-4, > April 2004, Pages 271-287). > > This is pretty much what I was after - a very simple signal processing > technique that can use the FFTs I need to do later anyway, fairly low > latecy, not too many fudge factors, evaluated against G.729, adaptive > multi-rate and advanced front end under speech recognition conditions.
In yesterday's mail I found: J. Ramirez, J.C. Segura, J.M. Gorriz, and L=2E Garcia, "Improved Voice Activity Detection Using Contextual Multiple Hypothesis Testing for Robust Speech Recognition," IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 8 (Nov. 2007), 2177-2189. Haven't read it yet, but the abstract indicates it presents a generalization of the method in the above-referenced work. cheers, jerry
Reply by Tony Robinson September 25, 20072007-09-25
Tomi Kinnunen <tkinnu@cs.joensuu.fi> writes:

> We have studied the LTSD-based VAD by Ramirez & al, though on the speaker > verification task. The method works quite nice, but there is one point > which needs caution. The method initializes a noise spectrum estimate from > the beginning of the speech stream/file.
Thanks Tomi - I did see that and thought I'd have to implement another initialisation strategy (probably thruogh an initial buffer and picking the low power frmaes as noise). I was also a little worried about the Aurora task as it's quite structured with regular spacings of words and noise - my real life data is nothing like that (different levels for different speakers, changing noise conditions, music in background, etc). Very good to hear that you got it going, that reduces my risk. Thanks for the feedback. Tony
Reply by Tomi Kinnunen September 24, 20072007-09-24
Hi, Tony,

We have studied the LTSD-based VAD by Ramirez & al, though on the speaker 
verification task. The method works quite nice, but there is one point 
which needs caution. The method initializes a noise spectrum estimate from 
the beginning of the speech stream/file. If this initial period is real 
speech instead of noise, the method can, and usually does, fail 
completely. We fixed this problem by giving an external manually 
extracted "noise initialization file", matching in channel conditions/SNR 
with the evaluation conditions. But I guess the problem of initializing the noise model 
exists for other VADs as well.

Good luck,
Tomi

In comp.speech.research Tony Robinson <tonyr@delthisbitk.cantabresearch.com> wrote:
: 
: Olivier Galibert <galibert@pobox.com> writes:
: 
:> On 2007-09-21, Tony Robinson <tonyr@delThisBit.cantabResearch.com> wrote:
:>> For a couple of reasons my attention has been drawn to voice activity
:>> detection (speech/non-speech detection).
:>
:> Probably as a side effect of NIST and/or the CHIL project, you'll find
:> most of the recent scientific papers on the subject under the name
:> "Speech Activity Detection" (SAD) instead of VAD.  At least when in an
:> ASR-environment instead of codec.
: 
: Thanks.   That's thrown up a few papers I hadn't found before, even if
: the majority use some pretrained HMM based solution.   I'll read these
: over the next few days.
: 
: Digging through an huge pile of papers at the weekend I came across:
: 
: "Efficient voice activity detection algorithms using long-term speech
: information" by Javier Ram&iacute;rez, Jos&eacute; C. Segura, Carmen Ben&iacute;tez, &#4294967295;?ngel de
: la Torre and Antonio Rubio (Speech Communication, Volume 42, Issues 3-4,
: April 2004, Pages 271-287).
: 
: This is pretty much what I was after - a very simple signal processing
: technique that can use the FFTs I need to do later anyway, fairly low
: latecy, not too many fudge factors, evaluated against G.729, adaptive
: multi-rate and advanced front end under speech recognition conditions.
: 
: I'll read the non-pretrained-HMM papers and compare.
: 
: 
: Tony
: 

-- 
Reply by Tony Robinson September 24, 20072007-09-24
Olivier Galibert <galibert@pobox.com> writes:

> On 2007-09-21, Tony Robinson <tonyr@delThisBit.cantabResearch.com> wrote: >> For a couple of reasons my attention has been drawn to voice activity >> detection (speech/non-speech detection). > > Probably as a side effect of NIST and/or the CHIL project, you'll find > most of the recent scientific papers on the subject under the name > "Speech Activity Detection" (SAD) instead of VAD. At least when in an > ASR-environment instead of codec.
Thanks. That's thrown up a few papers I hadn't found before, even if the majority use some pretrained HMM based solution. I'll read these over the next few days. Digging through an huge pile of papers at the weekend I came across: "Efficient voice activity detection algorithms using long-term speech information" by Javier Ram&iacute;rez, Jos&eacute; C. Segura, Carmen Ben&iacute;tez, &Aacute;ngel de la Torre and Antonio Rubio (Speech Communication, Volume 42, Issues 3-4, April 2004, Pages 271-287). This is pretty much what I was after - a very simple signal processing technique that can use the FFTs I need to do later anyway, fairly low latecy, not too many fudge factors, evaluated against G.729, adaptive multi-rate and advanced front end under speech recognition conditions. I'll read the non-pretrained-HMM papers and compare. Tony
Reply by Olivier Galibert September 24, 20072007-09-24
On 2007-09-21, Tony Robinson <tonyr@delThisBit.cantabResearch.com> wrote:
> For a couple of reasons my attention has been drawn to voice activity > detection (speech/non-speech detection).
Probably as a side effect of NIST and/or the CHIL project, you'll find most of the recent scientific papers on the subject under the name "Speech Activity Detection" (SAD) instead of VAD. At least when in an ASR-environment instead of codec. OG.
Reply by Regis September 24, 20072007-09-24
How is it done ?

Look for patents bearing the G10L11/02 IC or ECLA code...
;-)

Reply by Vladimir Vassilevsky September 23, 20072007-09-23
"Steve Underwood" <steveu@dis.org> wrote in message
news:fd1qfm$kq0$1@nnews.pacific.net.hk...

> They recently > changed the VAD in G.729 to try to improve it. It still has problems, > but I guess that is as good as anyone gets right now with low latency.
Steve, Can you please provide some details on that? I know the VAD algorithm which is typically used in the voice codecs; did they change it substantially or is it just tweaking of the values? VLV
Reply by Tony Robinson September 22, 20072007-09-22
minfitlike@yahoo.co.uk writes:

> Using power or variance threshold is not a good idea in all but the > simplest of cases. If the noise is stationary then you may get away > with it but seldon is this the case and what if the noise power is > higher than the signal?
I agree that a good solution has to work in non-stationary noise. I didn't advocate a variance threshold. I'm happy that the noise power is less than the voiced parts of the signal. One of the applications is speech recognition, and if the noise power is greater than the vowel energies then I know from practical experience I'll not get any useful speech recognition results.
> The hardest problem is that of separating one > speaker from another using a VAD eg for a car voice recognition system > - the second speaker being the noise (or a radio of course).
This is indeed tricky - in fact I'm prepared to allow quiet clean voice from a radio as part of the speech signal as several speech recognition applications are conversational, and then one speaker is often very much attenuated compared with the other.
> In such circumstances it is best to use a geometric approach and > calculate time-delays using (say) the PHAT algorithm. By working out > the delay to say 3 microphones (two is not unique having front-back > ambiguity but can be made to work) you can define a zone of activity > in front of the desired speech.
There are occasions where I have two microphones, but I've very little control over the hardware over the hardware and it's mostly a single channel.
> The speed of operation is quite another matter - the PHAT algorithm is > a form of Generalized Cross-correlation and uses > FFTs. Quite do-able but maybe not fast enough for your application. As > the speed of processors increase we will see more sophisticated VADs
Yes, I should have mentioned that I'll have a processor quite capable of doing FFTs. My idea at the moment is to do autocorrelation with a couple of FFTs to find high powered periodic sections and then assume these are voiced speech. Stuff that is a long way away from any voiced speech can be assumed to be noise. I'll then build some form of statistical model (there I go again as a speechie) and make a Viterbi boundary between what I think is speech and noise. So several seconds of latency, FFTs on ~50ms windows and Viterbi are in my computational budget. This puts me a long way from the VAD algorithms used in VOIP and the like, so I'd guess that a different class of techniques that could use the longer latency available and greater computational power would do better that the published techniques for VAD for telephony. Tony