DSPRelated.com
Forums

Vector Quantization for Speech Compression

Started by Carlos Moreno January 9, 2005
Hi,

As my research project, I'm going to study a variation of VQ
when applied to speech signals.

As a starting point, I'm trying to implement what I understand
as the "standard" vector quantization, so that I can later
implement the extensions I want to study, and compare the
results.

However, I'm having a hard time getting results of a minimally
acceptable quality  (I could post audio samples if you don't
believe me that the quality of the transcoded signal is truly
horrible).  So, I'd like to request some help -- in terms of
pointers, info on parameters typically used for this sort of
application, etc., feedback on what I may be doing wrong, etc.


Here's a rough description of what I'm doing:

I'm starting with a high-quality audio signal (44.1kHz, 16bit).
I set up overlapping frames of 1024 samples (approx 23msec),
with 50% overlap (i.e., one frame of 1024 samples every 512).

For each frame, I intend to obtain one index to represent it.

To obtain the index, I do the following:

- Take the 1024 samples, and weight them with a Hamming window.

- Take the FFT, discard the second half, and take magnitude of
   the complex values  (IOW, take the "amplitude spectrum" of
   the windowed frame)

- Now, take this 256-dimensional vector, and determine which
   point is closest to (regular Euclidean distance in the 256-
   dimensional space).

- The result of the process is the index corresponding to that
   closest point determined in the previous step.

Notice that the search process and the set of points where to
search are all amplitude spectra, but I keep track of the
correspondence between each index and the *actual frame* (the
actual signal, prior to any processing, even the Hamming
windowing)

In the "training process" (i.e., clustering all the training
samples into the "optimal" set of points to represent the
frames being compressed later on), when I cluster a set of
points and determine the centroid, I also compute the average
of the correspondent frames in the time-domain;  each index
will be associated to this "average in the time-domain".

When reconstructing the signal from a sequence of indices, I
take the frames corresponding to each index, and I apply a
Hamming window, and add them all, with the same overlap.


Preliminary tests:

I'm using a segment of approximately 1 minute of my recorded
voice for training, and then I transcode (encode/decode) a
few seconds of a different recording of my voice.

I'm using 10 bits for the indices (1024 possible spectral
points to quantize each frame).  The result is catastrophic.

I can (barely) understand what's being said (then again,
perhaps I understand it because I know what's being said),
but I was expecting an acceptable level of quality and
intelligibility.


I wonder if the fact that I'm using wide-spectrum can
make the training samples too spread and thus the clusters
end up mixing points that are way too far?

If I keep the same frames and same sampling rate, but apply
a "C weighting" to the amplitude spectrum, could the results
improve considerably?


I'll be most grateful for any thoughts, comments or feedback
that you may offer -- if you have any information on parameters
for typical implementations of this technique, that would also
be appreciated.

(BTW, I'm more or less confident that the actual implementation
is of the quantizer itself is bug-free -- I used it to implement
a "color palette" to compress images, and the results are quite
good.  Also, I did many tests for the FFT, and it passed the
debugging tests)

Thanks,

Carlos
--
Hi Carlos,

Let me check I am understanding you correctly. You are VQing the entire 
23ms block down to 10 bits? So you have a data rate of 435 bits per 
second? If that is right, you should expect it to sound dreadful. The 
most highly optimised speech compressors at that bit rate are only 
comprehensible to trained ears. Your approach is not optimised at all. 
You aren't grooming the signal before compression, so you are wasting 
bits on data which is not relevant to understandable speech. You are 
also consuming considerable data in massive overlaps, which is wasteful. 
In your scheme you need to overlap, but an optimised scheme would not.

Regards,
Steve


Carlos Moreno wrote:
> > Hi, > > As my research project, I'm going to study a variation of VQ > when applied to speech signals. > > As a starting point, I'm trying to implement what I understand > as the "standard" vector quantization, so that I can later > implement the extensions I want to study, and compare the > results. > > However, I'm having a hard time getting results of a minimally > acceptable quality (I could post audio samples if you don't > believe me that the quality of the transcoded signal is truly > horrible). So, I'd like to request some help -- in terms of > pointers, info on parameters typically used for this sort of > application, etc., feedback on what I may be doing wrong, etc. > > > Here's a rough description of what I'm doing: > > I'm starting with a high-quality audio signal (44.1kHz, 16bit). > I set up overlapping frames of 1024 samples (approx 23msec), > with 50% overlap (i.e., one frame of 1024 samples every 512). > > For each frame, I intend to obtain one index to represent it. > > To obtain the index, I do the following: > > - Take the 1024 samples, and weight them with a Hamming window. > > - Take the FFT, discard the second half, and take magnitude of > the complex values (IOW, take the "amplitude spectrum" of > the windowed frame) > > - Now, take this 256-dimensional vector, and determine which > point is closest to (regular Euclidean distance in the 256- > dimensional space). > > - The result of the process is the index corresponding to that > closest point determined in the previous step. > > Notice that the search process and the set of points where to > search are all amplitude spectra, but I keep track of the > correspondence between each index and the *actual frame* (the > actual signal, prior to any processing, even the Hamming > windowing) > > In the "training process" (i.e., clustering all the training > samples into the "optimal" set of points to represent the > frames being compressed later on), when I cluster a set of > points and determine the centroid, I also compute the average > of the correspondent frames in the time-domain; each index > will be associated to this "average in the time-domain". > > When reconstructing the signal from a sequence of indices, I > take the frames corresponding to each index, and I apply a > Hamming window, and add them all, with the same overlap. > > > Preliminary tests: > > I'm using a segment of approximately 1 minute of my recorded > voice for training, and then I transcode (encode/decode) a > few seconds of a different recording of my voice. > > I'm using 10 bits for the indices (1024 possible spectral > points to quantize each frame). The result is catastrophic. > > I can (barely) understand what's being said (then again, > perhaps I understand it because I know what's being said), > but I was expecting an acceptable level of quality and > intelligibility. > > > I wonder if the fact that I'm using wide-spectrum can > make the training samples too spread and thus the clusters > end up mixing points that are way too far? > > If I keep the same frames and same sampling rate, but apply > a "C weighting" to the amplitude spectrum, could the results > improve considerably? > > > I'll be most grateful for any thoughts, comments or feedback > that you may offer -- if you have any information on parameters > for typical implementations of this technique, that would also > be appreciated. > > (BTW, I'm more or less confident that the actual implementation > is of the quantizer itself is bug-free -- I used it to implement > a "color palette" to compress images, and the results are quite > good. Also, I did many tests for the FFT, and it passed the > debugging tests) > > Thanks, > > Carlos > --
Reply to:  Steve Underwood

Hi Steve,

Your post doesn't show through my newsreader, but I saw it
through groups.google.com.

Thanks for your comments and feedback.

The effective rate is actually 870 bits/sec (twice the figure
you mention), given that the frames start every 11msec -- the
fact that I "mask" each window with a Hamming weighting means
that only the central portion of the window (roughly half of
it)  provides highly relevant information;

So, I was expecting that even though I'm using 23msec windows
and an FFT for that 23msec segment, the spectra would nicely
cluster into 2^10 possible values, hoping that it would convey
only 11msec of effective information.

I'll play some more with the sizes and check the resulting
quality.

I was also suspecting that I might need to apply some weighting
to the spectral values, to spend the bits where there's useful
information -- your comments seem to confirm this.

One thing I'm not sure I got is what exactly you mean by "an
optimized scheme"?  Are there typical optimizations that are
done as "standard procedure" for this VQ technique?  Or are
you referring in general to optimizations that clever engineers
do when implementing this technique and that possibly keep
secret?

Part of the extension that I intend to investigate could be
seen as one of these possible optimizations;  I don't know if
there are other widely-known tricks that can be combined with
this technique?

Once again, your comments are very useful and much appreciated!
Thank you for taking the time to share this information!  (any
further comments will be equally appreciated! :-))

Carlos
--
Carlos Moreno wrote:

> > Reply to: Steve Underwood > > Hi Steve, > > Your post doesn't show through my newsreader, but I saw it > through groups.google.com. > > Thanks for your comments and feedback. > > The effective rate is actually 870 bits/sec (twice the figure > you mention), given that the frames start every 11msec -- the > fact that I "mask" each window with a Hamming weighting means > that only the central portion of the window (roughly half of > it) provides highly relevant information;
I forgot to allow for your overlaps. 870bps is still extremely low for voice coding.
> So, I was expecting that even though I'm using 23msec windows > and an FFT for that 23msec segment, the spectra would nicely > cluster into 2^10 possible values, hoping that it would convey > only 11msec of effective information. > > I'll play some more with the sizes and check the resulting > quality. > > I was also suspecting that I might need to apply some weighting > to the spectral values, to spend the bits where there's useful > information -- your comments seem to confirm this. > > One thing I'm not sure I got is what exactly you mean by "an > optimized scheme"? Are there typical optimizations that are > done as "standard procedure" for this VQ technique? Or are > you referring in general to optimizations that clever engineers > do when implementing this technique and that possibly keep > secret?
It is not optimised mainly in the sense that it is not tailored to speech, so it is wasting bits coding (relatively) useless things: - You aren't trying to code generic sounds. You are trying to code speech. Your coding doesn't take into account the limited range of sounds the human voice can generate. Any successful low bit rate voice codec will, in some way, do this. If you don't you are wasting a lot of your precious bits. LPC is the basis for most current commercial voice codecs, because it can be made to model the human voice tract quite will. Vector quantised LPC coeffs provide intelligible, but rather robotic, voice at the bit rate you are using. Far more bits are needed for good quality speech. Commercial codecs generally range from about 4kbps to 15kbps to provide telephone quality speech. Most of those bits are used to turn the robotic voice into a pleasant and identifiable one. - You are coding more spectrum than you need. 0-4kHz is adequate for telephony speech. Your VQ training is probably focusing the quantiser towards the low end of the spectrum, as it is where much of the energy is. However, using a lower sampling rate in the first place would be better. You are trying for a more aggressive bit rate than any commercial voice codec uses. Its even lower than most military codecs use - they care a lot about intelligibility, but a lot less about voice quaility. Try looking at what some existing codecs do.
> Part of the extension that I intend to investigate could be > seen as one of these possible optimizations; I don't know if > there are other widely-known tricks that can be combined with > this technique?
Voice coding has been one of the most heavily worked on, and patented, areas of DSP. This is largely due to the huge market for cellular radio. All successful codecs made extensive use of VQ, but I don't know of anybody applying it to the spectrum itself.
> Once again, your comments are very useful and much appreciated! > Thank you for taking the time to share this information! (any > further comments will be equally appreciated! :-))
Regards, Steve
You betcha, 870bits/sec is low.  We developed a 300bits/sec
and spent many months optimizing it.  It produced acceptable
intelligibility with acceptable quality mostly with low
pitched males.  Most commercial systems are at 2400 and up.

-- 
Chip Wood

"Steve Underwood" <steveu@dis.org> wrote in message
news:cs049t$6fa$1@home.itg.ti.com...
> Carlos Moreno wrote: > > > > > Reply to: Steve Underwood > > > > Hi Steve, > > > > Your post doesn't show through my newsreader, but I saw
it
> > through groups.google.com. > > > > Thanks for your comments and feedback. > > > > The effective rate is actually 870 bits/sec (twice the
figure
> > you mention), given that the frames start every
11msec -- the
> > fact that I "mask" each window with a Hamming weighting
means
> > that only the central portion of the window (roughly
half of
> > it) provides highly relevant information; > > I forgot to allow for your overlaps. 870bps is still
extremely low for
> voice coding. > > > So, I was expecting that even though I'm using 23msec
windows
> > and an FFT for that 23msec segment, the spectra would
nicely
> > cluster into 2^10 possible values, hoping that it would
convey
> > only 11msec of effective information. > > > > I'll play some more with the sizes and check the
resulting
> > quality. > > > > I was also suspecting that I might need to apply some
weighting
> > to the spectral values, to spend the bits where there's
useful
> > information -- your comments seem to confirm this. > > > > One thing I'm not sure I got is what exactly you mean by
"an
> > optimized scheme"? Are there typical optimizations that
are
> > done as "standard procedure" for this VQ technique? Or
are
> > you referring in general to optimizations that clever
engineers
> > do when implementing this technique and that possibly
keep
> > secret? > > It is not optimised mainly in the sense that it is not
tailored to
> speech, so it is wasting bits coding (relatively) useless
things:
> > - You aren't trying to code generic sounds. You are trying
to code
> speech. Your coding doesn't take into account the limited
range of
> sounds the human voice can generate. Any successful low
bit rate voice
> codec will, in some way, do this. If you don't you are
wasting a lot of
> your precious bits. LPC is the basis for most current
commercial voice
> codecs, because it can be made to model the human voice
tract quite
> will. Vector quantised LPC coeffs provide intelligible,
but rather
> robotic, voice at the bit rate you are using. Far more
bits are needed
> for good quality speech. Commercial codecs generally range
from about
> 4kbps to 15kbps to provide telephone quality speech. Most
of those bits
> are used to turn the robotic voice into a pleasant and
identifiable one.
> > - You are coding more spectrum than you need. 0-4kHz is
adequate for
> telephony speech. Your VQ training is probably focusing
the quantiser
> towards the low end of the spectrum, as it is where much
of the energy
> is. However, using a lower sampling rate in the first
place would be better.
> > You are trying for a more aggressive bit rate than any
commercial voice
> codec uses. Its even lower than most military codecs use -
they care a
> lot about intelligibility, but a lot less about voice
quaility. Try
> looking at what some existing codecs do. > > > Part of the extension that I intend to investigate could
be
> > seen as one of these possible optimizations; I don't
know if
> > there are other widely-known tricks that can be combined
with
> > this technique? > > Voice coding has been one of the most heavily worked on,
and patented,
> areas of DSP. This is largely due to the huge market for
cellular radio.
> All successful codecs made extensive use of VQ, but I
don't know of
> anybody applying it to the spectrum itself. > > > Once again, your comments are very useful and much
appreciated!
> > Thank you for taking the time to share this information!
(any
> > further comments will be equally appreciated! :-)) > > Regards, > Steve