Hi all, i am currently doing speech detection in white gaussian noise. I would take fft of a frame of noisy speech and if the maximum value is above a certain threshold, i can say that speech is detected. Of course, there is another part of the algorithm that does a more thorough check. My question is, i need to come up with a formula for this threshold. meaning threshold=x^2 + y for example, rather than threshold =0.7. How should i do it? I have an idea here but i don't know if it can work. That is, for example, if energy of frame=0.1, my threshold should be 0.5 if energy of frame=0.2, my threshold should be say 0.6 if energy of frame=0.3, my threshold should be say 0.63 Then i plot out these on cartesian plot and get the equation of the curve,that will be the formula for the threshold. Will this solve my problem? The curve will definitely not be linear. If so how can i find out the equation of the curve? Any advice? Thank you
Speech detection (Formulation Problem)
Started by ●June 22, 2006
Reply by ●June 23, 20062006-06-23
"doggie" <elusivetruelove2003@yahoo.com> wrote in message news:xpydnSwMG7dZhAfZnZ2dnUVZ_tmdnZ2d@giganews.com...> Hi all, i am currently doing speech detection in white gaussian noise. I > would take fft of a frame of noisy speech and if the maximum value is > above a certain threshold, i can say that speech is detected. Of course, > there is another part of the algorithm that does a more thorough check. > > My question is, i need to come up with a formula for this threshold. > meaning threshold=x^2 + y for example, rather than threshold =0.7. How > should i do it? >Energy methods are not too reliable. Also if the SNR is negative they nearly always fail. M.P -- Posted via a free Usenet account from http://www.teranews.com
Reply by ●June 23, 20062006-06-23
>Energy methods are not too reliable. Also if the SNR is negative theynearly>always fail. > >M.P > > >Hi, what you said is true. But i'm just a beginner trying to make something simple and im stuck at the formulation problem which i hope curve fitting and finding the equation of the curve can help. Do you think so? Thanks
Reply by ●June 23, 20062006-06-23
I've been doing some research on it and found that it is practically impossible to detect speech in noise in quick and reliable way. The spectral characteristics of unvoiced speech and noise are too similar. In cellular communication for example some initial noise estimation is usually done during first few hundred milliseconds of the conversation and then the voice activity detection is based on roughly 2 things: jump in signal energy relative to the initial estimation and presense of pitch. The drawback for this kind of algorithm is that it noise burst can fool it to think that it is speech. In this case another level of detection is needed that can be based on signal stationarity for example. Hope it helps, Michael "doggie" <elusivetruelove2003@yahoo.com> wrote in message news:xpydnSwMG7dZhAfZnZ2dnUVZ_tmdnZ2d@giganews.com...> Hi all, i am currently doing speech detection in white gaussian noise. I > would take fft of a frame of noisy speech and if the maximum value is > above a certain threshold, i can say that speech is detected. Of course, > there is another part of the algorithm that does a more thorough check. > > My question is, i need to come up with a formula for this threshold. > meaning threshold=x^2 + y for example, rather than threshold =0.7. How > should i do it? > > I have an idea here but i don't know if it can work. That is, for example, > if energy of frame=0.1, my threshold should be 0.5 > if energy of frame=0.2, my threshold should be say 0.6 > if energy of frame=0.3, my threshold should be say 0.63 > > Then i plot out these on cartesian plot and get the equation of the > curve,that will be the formula for the threshold. Will this solve my > problem? The curve will definitely not be linear. If so how can i find out > the equation of the curve? Any advice? > > Thank you >
Reply by ●June 24, 20062006-06-24
>I've been doing some research on it and found that it is practically >impossible to detect speech in noise in quick and reliable way. Thespectral>characteristics of unvoiced speech and noise are too similar. In cellular>communication for example some initial noise estimation is usually done >during first few hundred milliseconds of the conversation and then thevoice>activity detection is based on roughly 2 things: jump in signal energy >relative to the initial estimation and presense of pitch. The drawbackfor>this kind of algorithm is that it noise burst can fool it to think thatit>is speech. In this case another level of detection is needed that can be>based on signal stationarity for example. > >Hope it helps, >MichaelHi Micheal, i have the same sentiments too. The zero crossing rate of unvoiced speech and white noise is too similar. So that method is out. Right now, i have an algorithm that is able to detect very well all the speech frames in noise. Thus, i do not have the problem of speech being categorized as noise. However, there are some false alarm (noise categorized as speech) which i want to get rid of. I have cut my input signal into frame of 20 ms using half-overlapping hamming window. To get rid of these false alarm, i have tried taking the fft of those frame categorized as speech. If the max of the fft exceeds a certain threshold, it will remain as speech else it will be reclassified as noise. I have received quite a good result for white gaussian noise but the threshold was an numeric value which i set by observation. Thus now i am trying to formulate an equation for it so that this threshold would hold for all dB of noise.
Reply by ●June 24, 20062006-06-24
I'm really no expert, but maybe since it's gaussian noise, you could use the mean and standard deviation and count the number if times the fft energy of a certain fft bin is smaller than mean - std or bigger than mean + std ...
Reply by ●June 25, 20062006-06-25
You might look into using the autocorrelation function of speech. When I did this several years ago, I used a tenth second frame. The system is related to a detector for random signals and worked very well on voiced speech. Unfortunately, the SNR was so low that unvoiced speech was masked. In article <xpydnSwMG7dZhAfZnZ2dnUVZ_tmdnZ2d@giganews.com>, "doggie" <elusivetruelove2003@yahoo.com> wrote:>Hi all, i am currently doing speech detection in white gaussian noise. I >would take fft of a frame of noisy speech and if the maximum value is >above a certain threshold, i can say that speech is detected. Of course, >there is another part of the algorithm that does a more thorough check. > >My question is, i need to come up with a formula for this threshold. >meaning threshold=x^2 + y for example, rather than threshold =0.7. How >should i do it? > >I have an idea here but i don't know if it can work. That is, for example, >if energy of frame=0.1, my threshold should be 0.5 >if energy of frame=0.2, my threshold should be say 0.6 >if energy of frame=0.3, my threshold should be say 0.63 > >Then i plot out these on cartesian plot and get the equation of the >curve,that will be the formula for the threshold. Will this solve my >problem? The curve will definitely not be linear. If so how can i find out >the equation of the curve? Any advice? > >Thank you >
Reply by ●June 26, 20062006-06-26
>I'm really no expert, but maybe since it's gaussian noise, you could use >the mean and standard deviation and count the number if times the fft >energy of a certain fft bin is smaller than mean - std or bigger thanmean>+ std ... >Hmm.. sounds good. Thanks. I will try it out and see. I eventually hope to generalised to other types of noise too because gaussian noise is not very applicable in real world.
Reply by ●June 26, 20062006-06-26
>You might look into using the autocorrelation function of speech. When Idid>this several years ago, I used a tenth second frame. The system isrelated to>a detector for random signals and worked very well on voiced speech. >Unfortunately, the SNR was so low that unvoiced speech was masked. >Hi, i will try it. Guess i will have to set a threshold then for the autocorrelation too. By the way, how do you do the autocorrelation and what information do i get from there? Do i use the amplitude of the peak or the location that the peak occurs? lets say i have y=20ms frame of noisy signal(160 samples). How do i make use of this to do autocorrelation? Thank you.
Reply by ●June 28, 20062006-06-28
Essentially, I computer the autocorrelation in the usual way through the frequency domain. If you look at the autocorrelation function of noise, it is concentrated very near to the zero lag. Voiced speech has correlation components that are near zero but also significant values away from zero. I was also rolling off the data at two kHz because most of the energy in voiced speech is in the first two formants which are basically below two kHz. You willl need to test this in your system since this worked well for me because of the noise. In article <cKadnZ_seuosyQLZnZ2dnUVZ_tydnZ2d@giganews.com>, "doggie" <elusivetruelove2003@yahoo.com> wrote:>>You might look into using the autocorrelation function of speech. When I >did >>this several years ago, I used a tenth second frame. The system is >related to >>a detector for random signals and worked very well on voiced speech. >>Unfortunately, the SNR was so low that unvoiced speech was masked. >> > >Hi, i will try it. Guess i will have to set a threshold then for the >autocorrelation too. By the way, how do you do the autocorrelation and >what information do i get from there? Do i use the amplitude of the peak >or the location that the peak occurs? > >lets say i have y=20ms frame of noisy signal(160 samples). >How do i make use of this to do autocorrelation? > >Thank you.






