> pretrained 2-states HMM and dependent on labeled database, I recommend
> reading this paper Auto-Segmentation Based Partitioning and Clustering
> Approach to Robust Endpointing, Shi, Soong, Zhou.
Hello everyone.
I am also doing a research on VAD. I have no background on speech
processing so I apologize if my comments sound dumb.
If your concern is that many current techniques are based on
pretrained 2-states HMM and dependent on labeled database, I recommend
reading this paper Auto-Segmentation Based Partitioning and Clustering
Approach to Robust Endpointing, Shi, Soong, Zhou. I guess this one is
published either in 2006 or 2007.
This technique attempts to segment the sound signal using some
homegeneity criterion. The boundaries of each segment are cues of
speech`s edges. Because of the nature of segmentation, this technique
is independent of speech feature.
Martin
Reply by Jerry Wolf●October 23, 20072007-10-23
On Sep 24, 11:06 am, Tony Robinson
<to...@delThisBitk.cantabResearch.com> wrote:
> Digging through an huge pile of papers at the weekend I came across:
>
> "Efficient voice activity detection algorithms using long-term speech
> information" by Javier Ram=EDrez, Jos=E9 C. Segura, Carmen Ben=EDtez, =C1=
ngel de
> la Torre and Antonio Rubio (Speech Communication, Volume 42, Issues 3-4,
> April 2004, Pages 271-287).
>
> This is pretty much what I was after - a very simple signal processing
> technique that can use the FFTs I need to do later anyway, fairly low
> latecy, not too many fudge factors, evaluated against G.729, adaptive
> multi-rate and advanced front end under speech recognition conditions.
In yesterday's mail I found: J. Ramirez, J.C. Segura, J.M. Gorriz, and
L=2E Garcia, "Improved Voice Activity Detection Using Contextual
Multiple Hypothesis Testing for Robust Speech Recognition," IEEE
Trans. on Audio, Speech, and Language Processing, vol. 15, no. 8 (Nov.
2007), 2177-2189. Haven't read it yet, but the abstract indicates it
presents a generalization of the method in the above-referenced work.
cheers,
jerry
Reply by Tony Robinson●September 25, 20072007-09-25
Tomi Kinnunen <tkinnu@cs.joensuu.fi> writes:
> We have studied the LTSD-based VAD by Ramirez & al, though on the speaker
> verification task. The method works quite nice, but there is one point
> which needs caution. The method initializes a noise spectrum estimate from
> the beginning of the speech stream/file.
Thanks Tomi - I did see that and thought I'd have to implement another
initialisation strategy (probably thruogh an initial buffer and picking
the low power frmaes as noise). I was also a little worried about the
Aurora task as it's quite structured with regular spacings of words and
noise - my real life data is nothing like that (different levels for
different speakers, changing noise conditions, music in background,
etc). Very good to hear that you got it going, that reduces my risk.
Thanks for the feedback.
Tony
Reply by Tomi Kinnunen●September 24, 20072007-09-24
Hi, Tony,
We have studied the LTSD-based VAD by Ramirez & al, though on the speaker
verification task. The method works quite nice, but there is one point
which needs caution. The method initializes a noise spectrum estimate from
the beginning of the speech stream/file. If this initial period is real
speech instead of noise, the method can, and usually does, fail
completely. We fixed this problem by giving an external manually
extracted "noise initialization file", matching in channel conditions/SNR
with the evaluation conditions. But I guess the problem of initializing the noise model
exists for other VADs as well.
Good luck,
Tomi
In comp.speech.research Tony Robinson <tonyr@delthisbitk.cantabresearch.com> wrote:
:
: Olivier Galibert <galibert@pobox.com> writes:
:
:> On 2007-09-21, Tony Robinson <tonyr@delThisBit.cantabResearch.com> wrote:
:>> For a couple of reasons my attention has been drawn to voice activity
:>> detection (speech/non-speech detection).
:>
:> Probably as a side effect of NIST and/or the CHIL project, you'll find
:> most of the recent scientific papers on the subject under the name
:> "Speech Activity Detection" (SAD) instead of VAD. At least when in an
:> ASR-environment instead of codec.
:
: Thanks. That's thrown up a few papers I hadn't found before, even if
: the majority use some pretrained HMM based solution. I'll read these
: over the next few days.
:
: Digging through an huge pile of papers at the weekend I came across:
:
: "Efficient voice activity detection algorithms using long-term speech
: information" by Javier Ramírez, José C. Segura, Carmen Benítez, �?ngel de
: la Torre and Antonio Rubio (Speech Communication, Volume 42, Issues 3-4,
: April 2004, Pages 271-287).
:
: This is pretty much what I was after - a very simple signal processing
: technique that can use the FFTs I need to do later anyway, fairly low
: latecy, not too many fudge factors, evaluated against G.729, adaptive
: multi-rate and advanced front end under speech recognition conditions.
:
: I'll read the non-pretrained-HMM papers and compare.
:
:
: Tony
:
--
Reply by Tony Robinson●September 24, 20072007-09-24
Olivier Galibert <galibert@pobox.com> writes:
> On 2007-09-21, Tony Robinson <tonyr@delThisBit.cantabResearch.com> wrote:
>> For a couple of reasons my attention has been drawn to voice activity
>> detection (speech/non-speech detection).
>
> Probably as a side effect of NIST and/or the CHIL project, you'll find
> most of the recent scientific papers on the subject under the name
> "Speech Activity Detection" (SAD) instead of VAD. At least when in an
> ASR-environment instead of codec.
Thanks. That's thrown up a few papers I hadn't found before, even if
the majority use some pretrained HMM based solution. I'll read these
over the next few days.
Digging through an huge pile of papers at the weekend I came across:
"Efficient voice activity detection algorithms using long-term speech
information" by Javier Ramírez, José C. Segura, Carmen Benítez, Ángel de
la Torre and Antonio Rubio (Speech Communication, Volume 42, Issues 3-4,
April 2004, Pages 271-287).
This is pretty much what I was after - a very simple signal processing
technique that can use the FFTs I need to do later anyway, fairly low
latecy, not too many fudge factors, evaluated against G.729, adaptive
multi-rate and advanced front end under speech recognition conditions.
I'll read the non-pretrained-HMM papers and compare.
Tony
Reply by Olivier Galibert●September 24, 20072007-09-24
On 2007-09-21, Tony Robinson <tonyr@delThisBit.cantabResearch.com> wrote:
> For a couple of reasons my attention has been drawn to voice activity
> detection (speech/non-speech detection).
Probably as a side effect of NIST and/or the CHIL project, you'll find
most of the recent scientific papers on the subject under the name
"Speech Activity Detection" (SAD) instead of VAD. At least when in an
ASR-environment instead of codec.
OG.
Reply by Regis●September 24, 20072007-09-24
How is it done ?
Look for patents bearing the G10L11/02 IC or ECLA code...
;-)
Reply by Vladimir Vassilevsky●September 23, 20072007-09-23
"Steve Underwood" <steveu@dis.org> wrote in message
news:fd1qfm$kq0$1@nnews.pacific.net.hk...
> They recently
> changed the VAD in G.729 to try to improve it. It still has problems,
> but I guess that is as good as anyone gets right now with low latency.
Steve,
Can you please provide some details on that?
I know the VAD algorithm which is typically used in the voice codecs; did
they change it substantially or is it just tweaking of the values?
VLV
Reply by Tony Robinson●September 22, 20072007-09-22
minfitlike@yahoo.co.uk writes:
> Using power or variance threshold is not a good idea in all but the
> simplest of cases. If the noise is stationary then you may get away
> with it but seldon is this the case and what if the noise power is
> higher than the signal?
I agree that a good solution has to work in non-stationary noise. I
didn't advocate a variance threshold. I'm happy that the noise power
is less than the voiced parts of the signal. One of the applications is
speech recognition, and if the noise power is greater than the vowel
energies then I know from practical experience I'll not get any useful
speech recognition results.
> The hardest problem is that of separating one
> speaker from another using a VAD eg for a car voice recognition system
> - the second speaker being the noise (or a radio of course).
This is indeed tricky - in fact I'm prepared to allow quiet clean voice
from a radio as part of the speech signal as several speech recognition
applications are conversational, and then one speaker is often very much
attenuated compared with the other.
> In such circumstances it is best to use a geometric approach and
> calculate time-delays using (say) the PHAT algorithm. By working out
> the delay to say 3 microphones (two is not unique having front-back
> ambiguity but can be made to work) you can define a zone of activity
> in front of the desired speech.
There are occasions where I have two microphones, but I've very little
control over the hardware over the hardware and it's mostly a single
channel.
> The speed of operation is quite another matter - the PHAT algorithm is
> a form of Generalized Cross-correlation and uses
> FFTs. Quite do-able but maybe not fast enough for your application. As
> the speed of processors increase we will see more sophisticated VADs
Yes, I should have mentioned that I'll have a processor quite capable of
doing FFTs. My idea at the moment is to do autocorrelation with a
couple of FFTs to find high powered periodic sections and then assume
these are voiced speech. Stuff that is a long way away from any voiced
speech can be assumed to be noise. I'll then build some form of
statistical model (there I go again as a speechie) and make a Viterbi
boundary between what I think is speech and noise. So several seconds
of latency, FFTs on ~50ms windows and Viterbi are in my computational
budget.
This puts me a long way from the VAD algorithms used in VOIP and the
like, so I'd guess that a different class of techniques that could use
the longer latency available and greater computational power would do
better that the published techniques for VAD for telephony.
Tony