Importance of Vector Quantization in Audio Signals Processing

Started by d23 7 years ago5 replieslatest reply 7 years ago385 views

I was wondering if anyone knew whether vector quantization was important to audio signals processing?

I know of the importance of vector quantization in regards to digital signals processing when it comes to compression, but was wondering if there were any practical uses of vector quantization in the audio domain?

I did find a few papers on the use of vector quantization to classify audio (speech).

1. Audio Classification and Retrieval by Using Vector Quantization

2. Audiovisual Speech Coder

I was wondering if there was something more to vector quantization in audio.

Sorry about the vague question, but it's for a homework assignment, and I need to come up with at least 3 practical applications for vector quantization in audio signals processing. Thus far I've only been able to come up with 1 so far... after a few hours of scouring the web.

Thanks for the help!


[ - ]
Reply by dudelsoundAugust 10, 2017

AFAIK, it is used heavily in the open source 'Opus' codecs (ogg vorbis) that you can find at xiph.org and it's resources. They developed and use something they call PVQ (pyramid vector quantizer)...

[ - ]
Reply by d23August 10, 2017

Big thanks for the heads up.

From the looks of things

Mozilla - Pyramid Vector Quantization

It seems that PVQ is a more efficient method of performing a vector quantization.

Looks like I'll need to perform a bit more research on my end, as PVQ seems more suited for image compression than audio.

Either way, thanks for the help Dude!

[ - ]
Reply by khuaswAugust 10, 2017

In the early days of speech recognition vector quantization (VQ) was a commonly used technique for reducing the feature space to a more tractable size.

They first run an unsupervised classification on a bunch of speech data. This gives a few hundred or more classes of framed speech. Given some labeled speech the probability of each class being associated with a certain phoneme can be calculated. Next, those probabilities are fed into a probabilistic model, typically a hidden Markov model, for the decoding of speech.

In the 90s most people switched to a soft classifier, which is Gaussian mixture model. They also tried a great deal of other distributions, or even neural networks as a non-linear distribution fitter.

Btw I've also came across a paper which they did VQ on a very large high-quality music database. Then a fun thing they did was to regenerate high-frequency content from band-limited audio by frame matching and concatenation.

Useful reference: 

  • Fink, Gernot A. Markov models for pattern recognition: from theory to applications. Springer Science & Business Media, 2014.
[ - ]
Reply by d23August 10, 2017

A big thank you for the information.

Can I assume that the switch to the Gaussian mixture model was due to better speech recognition results versus the use of Vector Quantization?

Can I also assume that a Gaussian mixture model is a more intensive resource model (ex. uses more CPU, RAM, etc) versus Vector Quantization?

Thannks Khuasw!

[ - ]
Reply by khuaswAugust 10, 2017

I think your guesses are correct. The mathematical construction of GMMs allows people to apply fancy training criterion, especially those that make statistical sense, e.g. max likelihood or max a posteriori or minimum phoneme error rate. That also brings in a lot of computational complexity though. On the contrary, plain VQ requires just a discrete HMM, trainable with the textbook version of EM algorithm, an order of magnitude faster than HMM-GMM.

By 2000s computers are fast enough that none of those presents a problem unless you're doing the so called deep learning stuffs.

I'm not an expert in speech recognition so it might be better if you can dive into the literatures for a double check.