Hi, As my research project, I'm going to study a variation of VQ when applied to speech signals. As a starting point, I'm trying to implement what I understand as the "standard" vector quantization, so that I can later implement the extensions I want to study, and compare the results. However, I'm having a hard time getting results of a minimally acceptable quality (I could post audio samples if you don't believe me that the quality of the transcoded signal is truly horrible). So, I'd like to request some help -- in terms of pointers, info on parameters typically used for this sort of application, etc., feedback on what I may be doing wrong, etc. Here's a rough description of what I'm doing: I'm starting with a high-quality audio signal (44.1kHz, 16bit). I set up overlapping frames of 1024 samples (approx 23msec), with 50% overlap (i.e., one frame of 1024 samples every 512). For each frame, I intend to obtain one index to represent it. To obtain the index, I do the following: - Take the 1024 samples, and weight them with a Hamming window. - Take the FFT, discard the second half, and take magnitude of the complex values (IOW, take the "amplitude spectrum" of the windowed frame) - Now, take this 256-dimensional vector, and determine which point is closest to (regular Euclidean distance in the 256- dimensional space). - The result of the process is the index corresponding to that closest point determined in the previous step. Notice that the search process and the set of points where to search are all amplitude spectra, but I keep track of the correspondence between each index and the *actual frame* (the actual signal, prior to any processing, even the Hamming windowing) In the "training process" (i.e., clustering all the training samples into the "optimal" set of points to represent the frames being compressed later on), when I cluster a set of points and determine the centroid, I also compute the average of the correspondent frames in the time-domain; each index will be associated to this "average in the time-domain". When reconstructing the signal from a sequence of indices, I take the frames corresponding to each index, and I apply a Hamming window, and add them all, with the same overlap. Preliminary tests: I'm using a segment of approximately 1 minute of my recorded voice for training, and then I transcode (encode/decode) a few seconds of a different recording of my voice. I'm using 10 bits for the indices (1024 possible spectral points to quantize each frame). The result is catastrophic. I can (barely) understand what's being said (then again, perhaps I understand it because I know what's being said), but I was expecting an acceptable level of quality and intelligibility. I wonder if the fact that I'm using wide-spectrum can make the training samples too spread and thus the clusters end up mixing points that are way too far? If I keep the same frames and same sampling rate, but apply a "C weighting" to the amplitude spectrum, could the results improve considerably? I'll be most grateful for any thoughts, comments or feedback that you may offer -- if you have any information on parameters for typical implementations of this technique, that would also be appreciated. (BTW, I'm more or less confident that the actual implementation is of the quantizer itself is bug-free -- I used it to implement a "color palette" to compress images, and the results are quite good. Also, I did many tests for the FFT, and it passed the debugging tests) Thanks, Carlos --
Vector Quantization for Speech Compression
Started by ●January 9, 2005
Reply by ●January 9, 20052005-01-09
Hi Carlos, Let me check I am understanding you correctly. You are VQing the entire 23ms block down to 10 bits? So you have a data rate of 435 bits per second? If that is right, you should expect it to sound dreadful. The most highly optimised speech compressors at that bit rate are only comprehensible to trained ears. Your approach is not optimised at all. You aren't grooming the signal before compression, so you are wasting bits on data which is not relevant to understandable speech. You are also consuming considerable data in massive overlaps, which is wasteful. In your scheme you need to overlap, but an optimised scheme would not. Regards, Steve Carlos Moreno wrote:> > Hi, > > As my research project, I'm going to study a variation of VQ > when applied to speech signals. > > As a starting point, I'm trying to implement what I understand > as the "standard" vector quantization, so that I can later > implement the extensions I want to study, and compare the > results. > > However, I'm having a hard time getting results of a minimally > acceptable quality (I could post audio samples if you don't > believe me that the quality of the transcoded signal is truly > horrible). So, I'd like to request some help -- in terms of > pointers, info on parameters typically used for this sort of > application, etc., feedback on what I may be doing wrong, etc. > > > Here's a rough description of what I'm doing: > > I'm starting with a high-quality audio signal (44.1kHz, 16bit). > I set up overlapping frames of 1024 samples (approx 23msec), > with 50% overlap (i.e., one frame of 1024 samples every 512). > > For each frame, I intend to obtain one index to represent it. > > To obtain the index, I do the following: > > - Take the 1024 samples, and weight them with a Hamming window. > > - Take the FFT, discard the second half, and take magnitude of > the complex values (IOW, take the "amplitude spectrum" of > the windowed frame) > > - Now, take this 256-dimensional vector, and determine which > point is closest to (regular Euclidean distance in the 256- > dimensional space). > > - The result of the process is the index corresponding to that > closest point determined in the previous step. > > Notice that the search process and the set of points where to > search are all amplitude spectra, but I keep track of the > correspondence between each index and the *actual frame* (the > actual signal, prior to any processing, even the Hamming > windowing) > > In the "training process" (i.e., clustering all the training > samples into the "optimal" set of points to represent the > frames being compressed later on), when I cluster a set of > points and determine the centroid, I also compute the average > of the correspondent frames in the time-domain; each index > will be associated to this "average in the time-domain". > > When reconstructing the signal from a sequence of indices, I > take the frames corresponding to each index, and I apply a > Hamming window, and add them all, with the same overlap. > > > Preliminary tests: > > I'm using a segment of approximately 1 minute of my recorded > voice for training, and then I transcode (encode/decode) a > few seconds of a different recording of my voice. > > I'm using 10 bits for the indices (1024 possible spectral > points to quantize each frame). The result is catastrophic. > > I can (barely) understand what's being said (then again, > perhaps I understand it because I know what's being said), > but I was expecting an acceptable level of quality and > intelligibility. > > > I wonder if the fact that I'm using wide-spectrum can > make the training samples too spread and thus the clusters > end up mixing points that are way too far? > > If I keep the same frames and same sampling rate, but apply > a "C weighting" to the amplitude spectrum, could the results > improve considerably? > > > I'll be most grateful for any thoughts, comments or feedback > that you may offer -- if you have any information on parameters > for typical implementations of this technique, that would also > be appreciated. > > (BTW, I'm more or less confident that the actual implementation > is of the quantizer itself is bug-free -- I used it to implement > a "color palette" to compress images, and the results are quite > good. Also, I did many tests for the FFT, and it passed the > debugging tests) > > Thanks, > > Carlos > --
Reply by ●January 10, 20052005-01-10
Reply to: Steve Underwood Hi Steve, Your post doesn't show through my newsreader, but I saw it through groups.google.com. Thanks for your comments and feedback. The effective rate is actually 870 bits/sec (twice the figure you mention), given that the frames start every 11msec -- the fact that I "mask" each window with a Hamming weighting means that only the central portion of the window (roughly half of it) provides highly relevant information; So, I was expecting that even though I'm using 23msec windows and an FFT for that 23msec segment, the spectra would nicely cluster into 2^10 possible values, hoping that it would convey only 11msec of effective information. I'll play some more with the sizes and check the resulting quality. I was also suspecting that I might need to apply some weighting to the spectral values, to spend the bits where there's useful information -- your comments seem to confirm this. One thing I'm not sure I got is what exactly you mean by "an optimized scheme"? Are there typical optimizations that are done as "standard procedure" for this VQ technique? Or are you referring in general to optimizations that clever engineers do when implementing this technique and that possibly keep secret? Part of the extension that I intend to investigate could be seen as one of these possible optimizations; I don't know if there are other widely-known tricks that can be combined with this technique? Once again, your comments are very useful and much appreciated! Thank you for taking the time to share this information! (any further comments will be equally appreciated! :-)) Carlos --
Reply by ●January 11, 20052005-01-11
Carlos Moreno wrote:> > Reply to: Steve Underwood > > Hi Steve, > > Your post doesn't show through my newsreader, but I saw it > through groups.google.com. > > Thanks for your comments and feedback. > > The effective rate is actually 870 bits/sec (twice the figure > you mention), given that the frames start every 11msec -- the > fact that I "mask" each window with a Hamming weighting means > that only the central portion of the window (roughly half of > it) provides highly relevant information;I forgot to allow for your overlaps. 870bps is still extremely low for voice coding.> So, I was expecting that even though I'm using 23msec windows > and an FFT for that 23msec segment, the spectra would nicely > cluster into 2^10 possible values, hoping that it would convey > only 11msec of effective information. > > I'll play some more with the sizes and check the resulting > quality. > > I was also suspecting that I might need to apply some weighting > to the spectral values, to spend the bits where there's useful > information -- your comments seem to confirm this. > > One thing I'm not sure I got is what exactly you mean by "an > optimized scheme"? Are there typical optimizations that are > done as "standard procedure" for this VQ technique? Or are > you referring in general to optimizations that clever engineers > do when implementing this technique and that possibly keep > secret?It is not optimised mainly in the sense that it is not tailored to speech, so it is wasting bits coding (relatively) useless things: - You aren't trying to code generic sounds. You are trying to code speech. Your coding doesn't take into account the limited range of sounds the human voice can generate. Any successful low bit rate voice codec will, in some way, do this. If you don't you are wasting a lot of your precious bits. LPC is the basis for most current commercial voice codecs, because it can be made to model the human voice tract quite will. Vector quantised LPC coeffs provide intelligible, but rather robotic, voice at the bit rate you are using. Far more bits are needed for good quality speech. Commercial codecs generally range from about 4kbps to 15kbps to provide telephone quality speech. Most of those bits are used to turn the robotic voice into a pleasant and identifiable one. - You are coding more spectrum than you need. 0-4kHz is adequate for telephony speech. Your VQ training is probably focusing the quantiser towards the low end of the spectrum, as it is where much of the energy is. However, using a lower sampling rate in the first place would be better. You are trying for a more aggressive bit rate than any commercial voice codec uses. Its even lower than most military codecs use - they care a lot about intelligibility, but a lot less about voice quaility. Try looking at what some existing codecs do.> Part of the extension that I intend to investigate could be > seen as one of these possible optimizations; I don't know if > there are other widely-known tricks that can be combined with > this technique?Voice coding has been one of the most heavily worked on, and patented, areas of DSP. This is largely due to the huge market for cellular radio. All successful codecs made extensive use of VQ, but I don't know of anybody applying it to the spectrum itself.> Once again, your comments are very useful and much appreciated! > Thank you for taking the time to share this information! (any > further comments will be equally appreciated! :-))Regards, Steve
Reply by ●January 14, 20052005-01-14
You betcha, 870bits/sec is low. We developed a 300bits/sec and spent many months optimizing it. It produced acceptable intelligibility with acceptable quality mostly with low pitched males. Most commercial systems are at 2400 and up. -- Chip Wood "Steve Underwood" <steveu@dis.org> wrote in message news:cs049t$6fa$1@home.itg.ti.com...> Carlos Moreno wrote: > > > > > Reply to: Steve Underwood > > > > Hi Steve, > > > > Your post doesn't show through my newsreader, but I sawit> > through groups.google.com. > > > > Thanks for your comments and feedback. > > > > The effective rate is actually 870 bits/sec (twice thefigure> > you mention), given that the frames start every11msec -- the> > fact that I "mask" each window with a Hamming weightingmeans> > that only the central portion of the window (roughlyhalf of> > it) provides highly relevant information; > > I forgot to allow for your overlaps. 870bps is stillextremely low for> voice coding. > > > So, I was expecting that even though I'm using 23msecwindows> > and an FFT for that 23msec segment, the spectra wouldnicely> > cluster into 2^10 possible values, hoping that it wouldconvey> > only 11msec of effective information. > > > > I'll play some more with the sizes and check theresulting> > quality. > > > > I was also suspecting that I might need to apply someweighting> > to the spectral values, to spend the bits where there'suseful> > information -- your comments seem to confirm this. > > > > One thing I'm not sure I got is what exactly you mean by"an> > optimized scheme"? Are there typical optimizations thatare> > done as "standard procedure" for this VQ technique? Orare> > you referring in general to optimizations that cleverengineers> > do when implementing this technique and that possiblykeep> > secret? > > It is not optimised mainly in the sense that it is nottailored to> speech, so it is wasting bits coding (relatively) uselessthings:> > - You aren't trying to code generic sounds. You are tryingto code> speech. Your coding doesn't take into account the limitedrange of> sounds the human voice can generate. Any successful lowbit rate voice> codec will, in some way, do this. If you don't you arewasting a lot of> your precious bits. LPC is the basis for most currentcommercial voice> codecs, because it can be made to model the human voicetract quite> will. Vector quantised LPC coeffs provide intelligible,but rather> robotic, voice at the bit rate you are using. Far morebits are needed> for good quality speech. Commercial codecs generally rangefrom about> 4kbps to 15kbps to provide telephone quality speech. Mostof those bits> are used to turn the robotic voice into a pleasant andidentifiable one.> > - You are coding more spectrum than you need. 0-4kHz isadequate for> telephony speech. Your VQ training is probably focusingthe quantiser> towards the low end of the spectrum, as it is where muchof the energy> is. However, using a lower sampling rate in the firstplace would be better.> > You are trying for a more aggressive bit rate than anycommercial voice> codec uses. Its even lower than most military codecs use -they care a> lot about intelligibility, but a lot less about voicequaility. Try> looking at what some existing codecs do. > > > Part of the extension that I intend to investigate couldbe> > seen as one of these possible optimizations; I don'tknow if> > there are other widely-known tricks that can be combinedwith> > this technique? > > Voice coding has been one of the most heavily worked on,and patented,> areas of DSP. This is largely due to the huge market forcellular radio.> All successful codecs made extensive use of VQ, but Idon't know of> anybody applying it to the spectrum itself. > > > Once again, your comments are very useful and muchappreciated!> > Thank you for taking the time to share this information!(any> > further comments will be equally appreciated! :-)) > > Regards, > Steve