Hello, What are the standard methods for testing the quality of the output from a speech enhancement algorithm? I read that segmental SNR can be used, but that requires that I choose a set of frames where I _know_ that the frame contains speech. Isn't that cheating anyway? I would argue that the variance of the difference between the clean speech signal and the estimated speech signal is a better performance factor as it truly tells you how well your estimated speech signal matches the clean speech signal..... or am I wrong? I am also interested in knowing if there are any programs out there that will process the output (wav-file) from my speech enhancement algorithm and give a score with respect to speech recognition. Is that possible? Thanks in advance :o)
evaluation of speech enhancement algorithm. How?
Started by ●September 28, 2005
Reply by ●September 28, 20052005-09-28
The SNR is usually not a definitive quality indicator for speech (and neither for images, by the way), because the human hearing's quality objective is different than mean square error (minimizing SNR is equivalent to minimizing MSE). The "ultimate" quality criterion is something called MOS: mean opinion score. It's based on a series of "blind" tests using real humans as "measuring equipment". But some computer-based criteria are possible, such as "perception-weighted SNR (or MSE)" and ratio of spectra (which is like comparing distributions). See the book by Deller, Hansen, and ??? (forgot).
Reply by ●September 28, 20052005-09-28
Hi, You might be able to compare your algorithm with the articulation index (AI). The result is a scalar value between 0 and 1 (0= speech not understandable at all, 1=100% speech recognition). In summary, these models try to predict human speech recognition scores. Which is a psychoacoutic model of how intelligible speech is given the long term speech/noise spectrum and the system's response. These, however, like I said, will require the long term average of your speech (which you might not be interested in using). There is a free matlab file the calculates the AI with additional information about it at: http://server1.cdsp.neu.edu/info/students/hmuesch/ They use non-sense syllable for the estimation. If you signal is speech you will need to apply a context transformation in the end of it to compensate for the less entropy.
Reply by ●September 28, 20052005-09-28
Thanks. That's exactly what I have been looking for. However, for a newbie like me it would be perfect with a matlab-function that just takes the estimated clean speech as input and calculates some performance index based on that...Is that possible?
Reply by ●September 28, 20052005-09-28
The Mean Opinion Score (MOS) is the "Right" way to check for speech quality, but there are Objective ways to do it that correlate fairly well with the MOS. For example: http://www.antd.nist.gov/wctg/manet/speechq.pdf My question is how do you have "Clean" speech to compare to? Are you taking clean speech, corrupting it electrically with some "Noise" and then "enhancing" it? Are you simultaneously recording with a different mic close to the mouth? Lots depends on the source signal(s) and how the noise is being added. BTW, MOS measures speech quality, not intelligibility. The best way to test whether your "Enhancement" program affects Speech Recognition scores is to run it through a Speech recognizer. It will give you a nice "Objective" score every time. At least for that Recognizer and that kind of input. -- Chip Wood "Lars Hansen" <invalid@nospam.com> wrote in message news:433a5abd$0$78287$157c6196@dreader1.cybercity.dk...> Hello, > > What are the standard methods for testing the quality ofthe output from a> speech enhancement algorithm? > > I read that segmental SNR can be used, but that requiresthat I choose a set> of frames where I _know_ that the frame contains speech.Isn't that cheating> anyway? I would argue that the variance of the differencebetween the clean> speech signal and the estimated speech signal is a betterperformance factor> as it truly tells you how well your estimated speechsignal matches the> clean speech signal..... or am I wrong? > > I am also interested in knowing if there are any programsout there that> will process the output (wav-file) from my speechenhancement algorithm and> give a score with respect to speech recognition. Is thatpossible?> > Thanks in advance :o) > > > >
Reply by ●September 28, 20052005-09-28
Hi, Not exactly. Here are somethings you have to consider: 1) You would have to have access to the original speech and noise as separate signals if you wish to compare the original AI with a second AI from your predicted speech, otherwise it won't work. Your estimation noise input for the second AI is the prediction error e= speeh - predicted. Your speech input to the second AI would still be the original speech (if your prediction is good, your error should be smaller than the original noise). 2) You have to calculate the spectrum of your speech signals and noise. Average them over 20 discrete bands given by the AI help, and feed them into the AI algorithm. You can assume the system's orthometric response is 0.
Reply by ●September 28, 20052005-09-28
What do you mean by 'enhancement'? Are you including intelligibility in what you are describing as quality? What is the character and source of the speech, and what is the source of the problem that requires enhancement? More to follow depending on your answers. Dirk
Reply by ●September 28, 20052005-09-28
Hi Thanks for the link :o) Maybe I should have started out with a brief description of my algorithm. First of all...I have 2 wav-files. One file is the clean speech signal (s[k]) and the other file is pink noise (n[k]). These two files are added together after having scaled the pink noise signal according to a pre-defined SNR. The noisy speech signal x[k] is fed to my algorithm 16 samples at a time. The 16 samples are put in a 256 samples long frame. Based on the contents of the frame I calculate the power spectrum X of the frame and subtract the average noise power spectrum N and obtain the power spectrum P. I have found this average noise power spectrum N by averaging lots of power spectras which I know only contain noise. I then convert P to a modelbased spectrum (LPC-analysis). I then do some filtering and convert the manipulated power spectrum Pm to LPC-coefficients. Thats the frame processing. The block processing takes in those 16 samples I mentioned in the beginning and send them through a whitening filter. The whitened signal is sent through a filter that models the vocal tract. The filter that models the vocal tract gets its coefficients from the LPC-analysis performed on the manipulated power spectrum Pm. The output from the vocal tract filter is the estimated speech. I hope this clarifies things a little bit :o) Where can I download a speech recognizer that will accept a wav-file (the estimated speech) as input? Thanks in advance..
Reply by ●September 28, 20052005-09-28
Reply by ●September 28, 20052005-09-28
So you are trying to make the voice sound like you had not added the pink noise in the first place? Can you give some idea what your relative speech/noise levels are? Where did the whitening filter come from? When you turn the noise level down and analyze/synthesize the waveforms do the original (clean) signal and reconstructed signal look similar on a point-by-point basis? If not, then subtracting waveforms and looking at errors is not the way to evaluate the 'quality'. Do some searching on "LPC analysis in noise". Dirk






