DSPRelated.com
Forums

evaluation of speech enhancement algorithm. How?

Started by Lars Hansen September 28, 2005
Hello,

What are the standard methods for testing the quality of the output from a 
speech enhancement algorithm?

I read that segmental SNR can be used, but that requires that I choose a set 
of frames where I _know_ that the frame contains speech. Isn't that cheating 
anyway? I would argue that the variance of the difference between the clean 
speech signal and the estimated speech signal is a better performance factor 
as it truly tells you how well your estimated speech signal matches the 
clean speech signal..... or am I wrong?

I am also interested in knowing if there are any programs out there that 
will process the output (wav-file) from my speech enhancement algorithm and 
give a score with respect to speech recognition. Is that possible?

Thanks in advance :o)




The SNR is usually not a definitive quality indicator for speech (and
neither for images, by the way), because the human hearing's quality
objective is different than mean square error (minimizing SNR is
equivalent to minimizing MSE).

The "ultimate" quality criterion is something called MOS: mean opinion
score. It's based on a series of "blind" tests using real humans as
"measuring equipment".

But some computer-based criteria are possible, such as
"perception-weighted SNR (or MSE)" and ratio of spectra (which is like
comparing distributions). See the book by Deller, Hansen, and ???
(forgot).

Hi,

You might be able to compare your algorithm with the articulation index
(AI). The result is a scalar value between 0 and 1 (0= speech not
understandable at all, 1=100% speech recognition). In summary, these
models try to predict human speech recognition scores. Which is a
psychoacoutic model of how intelligible speech is given the long term
speech/noise spectrum and the system's response. These, however, like I
said, will require the long term average of your speech  (which you
might not be interested in using). There is a free matlab file the
calculates the AI with additional information about it at:

http://server1.cdsp.neu.edu/info/students/hmuesch/

They use non-sense syllable for the estimation. If you signal is speech
you will need to apply a context transformation in the end of it to
compensate for the less entropy.

Thanks. That's exactly what I have been looking for. However, for a newbie 
like me
it would be perfect with a matlab-function that just takes the estimated 
clean speech
as input and calculates some performance index based on that...Is that 
possible?





The Mean Opinion Score (MOS) is the "Right" way to check for
speech quality, but  there are Objective ways to do it that
correlate fairly well with the MOS.  For example:
http://www.antd.nist.gov/wctg/manet/speechq.pdf

My question is how do you have "Clean" speech to compare to?
Are you taking clean speech, corrupting it electrically with
some "Noise" and then "enhancing" it?  Are you
simultaneously recording with a different mic close to the
mouth?  Lots depends on the source signal(s) and how the
noise is being added.

BTW, MOS measures speech quality, not intelligibility.  The
best way to test whether your "Enhancement" program affects
Speech Recognition scores is to run it through a Speech
recognizer.  It will give you a nice "Objective" score every
time.  At least for that Recognizer and that kind of input.

-- 
Chip Wood

"Lars Hansen" <invalid@nospam.com> wrote in message
news:433a5abd$0$78287$157c6196@dreader1.cybercity.dk...
> Hello, > > What are the standard methods for testing the quality of
the output from a
> speech enhancement algorithm? > > I read that segmental SNR can be used, but that requires
that I choose a set
> of frames where I _know_ that the frame contains speech.
Isn't that cheating
> anyway? I would argue that the variance of the difference
between the clean
> speech signal and the estimated speech signal is a better
performance factor
> as it truly tells you how well your estimated speech
signal matches the
> clean speech signal..... or am I wrong? > > I am also interested in knowing if there are any programs
out there that
> will process the output (wav-file) from my speech
enhancement algorithm and
> give a score with respect to speech recognition. Is that
possible?
> > Thanks in advance :o) > > > >
Hi,

Not exactly. Here are somethings you have to consider:

1) You would have to have access to the original speech and noise as
separate signals if you wish to compare the original AI with a second
AI from your predicted speech, otherwise it won't work. Your estimation
noise input for the second AI is the prediction error  e= speeh -
predicted. Your speech input to the second AI would still be the
original speech (if your prediction is good, your error should be
smaller than the original noise).

2) You have to calculate the spectrum of your speech signals and noise.
Average them over 20 discrete bands given by the AI help, and feed them
into the AI algorithm. You can assume the system's orthometric response
is 0.

What do you mean by 'enhancement'? Are you including intelligibility in
what you are describing as quality?

What is the character and source of the speech, and what is the source
of the problem that requires enhancement?

More to follow depending on your answers.

Dirk

Hi

Thanks for the link :o)

Maybe I should have started out with a brief description of my algorithm. 
First of all...I have 2 wav-files.
One file is the clean speech signal (s[k]) and the other file is pink noise 
(n[k]). These two files are added together after
having scaled the pink noise signal according to a pre-defined SNR. The 
noisy speech signal x[k] is fed to my algorithm
16 samples at a time. The 16 samples are put in a 256 samples long frame. 
Based on the contents of the frame
I calculate the power spectrum X of the frame and subtract the average noise 
power spectrum N and obtain the power spectrum P.  I have found this
average noise power spectrum N by averaging lots of power spectras which I 
know only contain noise. I then convert P to a modelbased spectrum 
(LPC-analysis).
I then do some filtering and convert the manipulated power spectrum Pm to 
LPC-coefficients. Thats the frame processing.

The block processing takes in those 16 samples I mentioned in the beginning 
and send them through a whitening filter. The whitened signal
is sent through a filter that models the vocal tract. The filter that models 
the vocal tract gets its coefficients from the LPC-analysis performed
on the manipulated power spectrum Pm. The output from the vocal tract filter 
is the estimated speech.

I hope this clarifies things a little bit :o)

Where can I download a speech recognizer that will accept a wav-file (the 
estimated speech) as input?

Thanks in advance..


> > More to follow depending on your answers. >
see my post above :o)
So you are trying to make the voice sound like you had not added the
pink noise in the first place? Can you give some idea what your
relative speech/noise levels are?

Where did the whitening filter come from?

When you turn the noise level down and analyze/synthesize the waveforms
do the original (clean) signal and reconstructed signal look similar on
a point-by-point basis? If not, then subtracting waveforms and looking
at errors is not the way to evaluate the 'quality'.

Do some searching on "LPC analysis in noise".  

Dirk