Reply by September 8, 20052005-09-08
Carlos Moreno <moreno_at_mochima_dot_com@xx.xxx> writes:

> tony.nospam@nospam.tonyRobinson.com wrote: > > > I've other comments, on the use of polynomials (they don't do well at > > infinity) and other things - but perhaps best we take this offline? > > That's a very generous offer! I would have to add your e-mail to the > "whitelist" in my spam filter; do I simply remove the two occurences > of "nospam dot" from the address that appears on the newsgroup?
Yes, just s/.nospam@nospam./@/ - I tried emailing you and so can confirm that your whitelist spam filter works as intended! Tony
Reply by Carlos Moreno September 6, 20052005-09-06
tony.nospam@nospam.tonyRobinson.com wrote:

> I've other comments, on the use of polynomials (they don't do well at > infinity) and other things - but perhaps best we take this offline?
That's a very generous offer! I would have to add your e-mail to the "whitelist" in my spam filter; do I simply remove the two occurences of "nospam dot" from the address that appears on the newsgroup? Cheers, Carlos --
Reply by September 6, 20052005-09-06
Carlos Moreno <moreno_at_mochima_dot_com@xx.xxx> writes:

> > Okay - what I was trying to get at here is that it is easy to eat bits, > > especially if you have several different things to quantise. I've also > > written an music coder which allocates some bits to linear prediction > > coefficients, some to single impules and some to VQ codebooks of the > > remaining residual (in my case the impulses and residual was in the > > Fourier domain, but that's not important). Whilst both the impulses and > > VQ codebooks both need bits allocated to them, they are of different > > types so it's not obvious how many bits to allocate to each > > Yes, this was at some point a big concern (all of the ideas I had to > approach it sounded too "empiric"), but then I figured that even if I > get a sub-optimal encoder, as long as that encoder is the same for both > cases (with the standard technique and with the modifications I want > to try), I should be ok... I know that that is not entirely true, > since the effectiveness of the method may be affected by what's > happening before...
If you take out your two tenth order polynomials and the impulse coder you have a standard residual excited linear predictor. You say that your main project is to so something new (which you haven't disclosed) to this framework. It seems you are exploring areas not in the project spec - that maybe very good but I've seen a lot of projects fail badly because other areas were explored and the main goals not met so no strong conclusions could be reached. What's more, even if you do innovate and produce a new CELP style framework, it'll be hard to judge the original project goals against it as your innovations will take a while to justify in themselves. Don't get me wrong, I think you are having very good ideas, but perhaps there are too many of them all at once.
> >>http://www.mochima.com/tmp/residual.png > > Right - I thought I was looking at waveforms here. I think there is > > something wrong with the generation of the original (green) residual. > > Unless my eyes are decieving me, it has a large high frequency component > > with period of a few samples. > > I don't see it -- perhaps you're talking about the interval after the > first peak, approx. time index 5550 to 5560? > > I think it's not exactly a pure high-frequency component -- some of the > peaks do include two samples, some don't. Being part of a bigger frame > where *the whole* frame is being minimized, I found it reasonable to > encounter these.
Okay, I haven't - if you like send me the data.
> I guess this is related to your point of not liking the idea that the > residual can be modeled by a 10th-order polynomial? Meaning perhaps > that if the frame looks like a polynomial, then it means that the > correlation was not properly eliminated and that would mean that the > predictor was not optimal?
LP predictors are often non-optimal as the signal changes within the frame. Sometimes you really do get a very sharp impulse and small random noise distributed throughout the rest of the frame, more often there is a low frequency component around the impulse - if the signal changes within the frame this this is at a low frequency. I haven't seen such a narrowband high frequency component before. I've other comments, on the use of polynomials (they don't do well at infinity) and other things - but perhaps best we take this offline? Tony
Reply by Carlos Moreno September 2, 20052005-09-02
Carlos Moreno wrote:

>> Have you tried test signals such as simple >> sinusiods (which should have very low residuals - zero even if it wasn't >> for boundary conditions). > > No, I haven't. I'm going to do it right away -- it's a very good idea.
Follow-up question: what kind of amplitude for the residual would be reasonable for this? I just tried feeding a pure sinusoid with amplitude 5000 (i.e., from -5000 to +5000), and get a residual with amplitude 10 (from -10 to +10). The equivalent frequency of the signal would be approximately 3kHz (at a sampling rate of 8000kHz, I replace the signal with: speech[i] = 5000*sin(i); With lower-frequency inputs, like speech[i] = 5000*sin(i/5.0), I get a residual with amplitude 5 (from -5 to 5). With a mixture of three frequencies: speech[i] = 5000 * (sin(i) + sin(i/5.0) + sin(i/30.0)); The residual goes up -- different frames are different, but the highest ones go from -150 to +150. I guess this might still be ok, perhaps because the filter has to do a trade-off between minimizing the output for each of the tones? Does the above sound reasonable? Thanks, Carlos --
Reply by Carlos Moreno September 2, 20052005-09-02
Thanks Tony for your interest and your detailed reply!

>>>[...] >>>If your final goal is VQ on the residual you are adding in a lot of >>>other processing steps. Do you know how you are going to allocate the >>>number of bits for each step? Do you have an idea as to what you'd like >>>the final bit rate to be? >> >>No, not really. The important thing from the point of view of the >>research project I'm working on is that I need to apply VQ with a >>given bitrate and then try some modifications (which is the core of >>my project -- evaluating the effectiveness of this "variation" of >>the VQ technique) and see how the quality of the output compares >>at the same bitrate. > > > Okay - what I was trying to get at here is that it is easy to eat bits, > especially if you have several different things to quantise. I've also > written an music coder which allocates some bits to linear prediction > coefficients, some to single impules and some to VQ codebooks of the > remaining residual (in my case the impulses and residual was in the > Fourier domain, but that's not important). Whilst both the impulses and > VQ codebooks both need bits allocated to them, they are of different > types so it's not obvious how many bits to allocate to each
Yes, this was at some point a big concern (all of the ideas I had to approach it sounded too "empiric"), but then I figured that even if I get a sub-optimal encoder, as long as that encoder is the same for both cases (with the standard technique and with the modifications I want to try), I should be ok... I know that that is not entirely true, since the effectiveness of the method may be affected by what's happening before... [...]
>>That's why I figured it might be worth trying to model the >>"amplitude of the noise" as a function of the position in the >>frame. This basically leads to a best-fitting polynomial for >>the square deviation -- that gives me the best approximation >>of the "variance as a function of the position", and then I >>use that when generating the pseudo-random noise. > > > I'm still not entirely sure what both polynomials do. I'm assuming that > one models the residual signal, and the other the variance of this.
Yes. (well, both polynomials deal with the signal *without* the peaks)
>>http://www.mochima.com/tmp/residual.png > > Right - I thought I was looking at waveforms here. I think there is > something wrong with the generation of the original (green) residual. > Unless my eyes are decieving me, it has a large high frequency component > with period of a few samples.
I don't see it -- perhaps you're talking about the interval after the first peak, approx. time index 5550 to 5560? I think it's not exactly a pure high-frequency component -- some of the peaks do include two samples, some don't. Being part of a bigger frame where *the whole* frame is being minimized, I found it reasonable to encounter these.
> Have you tried test signals such as simple > sinusiods (which should have very low residuals - zero even if it wasn't > for boundary conditions).
No, I haven't. I'm going to do it right away -- it's a very good idea. I guess this is related to your point of not liking the idea that the residual can be modeled by a 10th-order polynomial? Meaning perhaps that if the frame looks like a polynomial, then it means that the correlation was not properly eliminated and that would mean that the predictor was not optimal? Keep in mind that the predictor operates with a small number of taps (I set it to 15 -- though I guess I could play with that and see how it affects the residual?). So, having the "noise" follow a non-zero curve seemed reasonable (as well as having an amplitude/variance that varies over the frame). That's why I tried to model these with two polynomials -- perhaps 10th-order is way too much (maybe 5th- or 6th-order would suffice? But still, if the curve fits perfectly on a 5th-order polynomial, then the best-fitting 10th-order polynomial should have coefficients with values 0 above the 6th coefficient; and if all the frames behave like that, then the Vector Quantiser should get rid of that -- it should extract information from where information is valuable). That's why I didn't try too hard to optimise the polynomials, and thought it was better to err on the side of overkill in this case.
> Well that's good. I'm 100% with Rune in that I think it very important > to get the simplest scheme going first, then add in layers of > complexity.
Ok.
>>I'm quite puzzled, because *to the eye*, the approximation that I'm >>getting from the parametric representation looks *infinitely better* >>than what I'm getting from the standard VQ; however, the actual >>sound that I get is quite the opposite: the VQ version sounds >>infinitely better than mine... > > Well it's pretty amazing that a single impulse per pitch period or white > noise can generate perfectly intelligable if not 100% natural speech
Yep!! It's been more than a year since my "Speech Communications" course, and I still can't get over that! :-)
>>>Also - have you read the literature on CELP coders? >> >>My advisor suggested that when I talked to him about this. From what >>I could understand, it sounded a lot like an analysis-by-synthesis >>Vector Quantizer, but with the codebook artificially generated? > > I'd defintely look it up - most books on speech/audio coding will cover > CELP.
Ok, I'll look it up.
>>Well, I'm enjoying it a lot! Unfortunately, I can not disclose the >>other part (the presumably important part), at least until I finish >>writing the thesis and the results and the idea is published (I >>mean, I don't know if I can, but I prefer not to take the risk of >>getting in trouble with McGill ;-)). > > Sounds good. If you publish online then please send me a link.
I guess at least I could post a summary of what I tried (describing the technique) and what I found out. Thanks again for your comments and suggestions! Carlos --
Reply by September 2, 20052005-09-02
Hi Carlos,

Good to get a long reply.  In a past life I've done and supervised many
LP/VQ projects, some of which worked well, some of which didn't.

Carlos Moreno <moreno_at_mochima_dot_com@xx.xxx> writes:

> tony.nospam@nospam.tonyRobinson.com wrote: > > > [...] > > If your final goal is VQ on the residual you are adding in a lot of > > other processing steps. Do you know how you are going to allocate the > > number of bits for each step? Do you have an idea as to what you'd like > > the final bit rate to be? > > No, not really. The important thing from the point of view of the > research project I'm working on is that I need to apply VQ with a > given bitrate and then try some modifications (which is the core of > my project -- evaluating the effectiveness of this "variation" of > the VQ technique) and see how the quality of the output compares > at the same bitrate.
Okay - what I was trying to get at here is that it is easy to eat bits, especially if you have several different things to quantise. I've also written an music coder which allocates some bits to linear prediction coefficients, some to single impules and some to VQ codebooks of the remaining residual (in my case the impulses and residual was in the Fourier domain, but that's not important). Whilst both the impulses and VQ codebooks both need bits allocated to them, they are of different types so it's not obvious how many bits to allocate to each - and if you get it wrong then the coder isn't as good as it could have been. You also have two 10th order polynomials to include.
> > What happens if you just use the four largest values as the residual? > > For voiced speech I'd expect reasonable output. > > Interesting variation. In fact, when I feed the reconstruction > filter with the peaks added to the best-fitting polynomial (which > is in essence sort of an "ultra low-pass-filtered" version, an > ultra-smoothed version), then I get "reasonable" output (i.e., > perfectly intelligible, but with an artificial quality, quite > uneasy to the ear)
I was suggesting making things really simple and only using four peaks - I don't think your 10th order polynomial should be needed - more on that later.
> > Are there two 10th order polynomials involved here? If so, what is the > > first doing? Have you tried "just finding the variance" - if not, then > > it's always best to do the simple things first. > > Yes, I had originally tried with just the variance; when seeing > that it didn't work nicely at all, I then observed that the apparent > amplitude of the noise was not constant. In particular, near the > peaks, there seems to consistently be more activity; > > That's why I figured it might be worth trying to model the > "amplitude of the noise" as a function of the position in the > frame. This basically leads to a best-fitting polynomial for > the square deviation -- that gives me the best approximation > of the "variance as a function of the position", and then I > use that when generating the pseudo-random noise.
I'm still not entirely sure what both polynomials do. I'm assuming that one models the residual signal, and the other the variance of this. I like the idea that the variance of the residual fluctuates over a frame of voiced speech - intuitively it should be greater when there is glottal excitation and less in the closed period. Off the top of my head I can't think of codecs that exploit this - perhaps others in these newsgoroups could contribute references - if they exist. I don't like the idea that your residual itself can be modelled by a tenth order polynomial - more on that...
> > This sounds like you have two polynomials - I'd be surprised if the mean > > of the residuals is far from zero - why not post plots of the residual? > > I did. In the link I posted, the green signal is the actual > residual, and the red one is the reconstructed one: > > http://www.mochima.com/tmp/residual.png
Right - I thought I was looking at waveforms here. I think there is something wrong with the generation of the original (green) residual. Unless my eyes are decieving me, it has a large high frequency component with period of a few samples. The whole idea of the linear prediction is to remove correlations over this timescale so I would guess that this hasn't operated correctly. Have you tried test signals such as simple sinusiods (which should have very low residuals - zero even if it wasn't for boundary conditions).
> > Good. However I still think you are trying to do too much at once. For > > clean speech you can get away with a pulse train for voiced excitation > > and uniform random noise for unvoiced and it doesn't sound too bad - > > better then what you are describing. > > Hmmm... I wonder if we're having different ideas/thresholds in our > descriptions of "sounds ok" or "sounds horrible"... Perhaps I should > have posted samples of the sounds? :-)
Seems like you have access to a web site where you could.
> As I said in my reply to Rune, when trying the "naive" VQ approach > directly taking the entire residual as a 128-dimensional vector (well, > 64-dimensional, as I tried reducing the frame size to 64), even though > the "transcoded" residual looks extremely different, the quality of > the reconstructed audio is astonishing. And I mean *really* better, > as in comparing a badly-tuned AM radio to a regular-quality FM.
Well that's good. I'm 100% with Rune in that I think it very important to get the simplest scheme going first, then add in layers of complexity.
> I'm quite puzzled, because *to the eye*, the approximation that I'm > getting from the parametric representation looks *infinitely better* > than what I'm getting from the standard VQ; however, the actual > sound that I get is quite the opposite: the VQ version sounds > infinitely better than mine...
Well it's pretty amazing that a single impulse per pitch period or white noise can generate perfectly intelligable if not 100% natural speech (given clean input). In this case the synthetic excitation looks nothing like the real thing, but the spectral properties are preserved and that seems to be what is important. Indeed, all of speech recognition is built on this, only the spectral envelope is used and all of the excitaton signal is thrown away.
> > Also - have you read the literature on CELP coders? > > My advisor suggested that when I talked to him about this. From what > I could understand, it sounded a lot like an analysis-by-synthesis > Vector Quantizer, but with the codebook artificially generated?
I'd defintely look it up - most books on speech/audio coding will cover CELP.
> > I like your project - I hope you enjoy it. > > Well, I'm enjoying it a lot! Unfortunately, I can not disclose the > other part (the presumably important part), at least until I finish > writing the thesis and the results and the idea is published (I > mean, I don't know if I can, but I prefer not to take the risk of > getting in trouble with McGill ;-)).
Sounds good. If you publish online then please send me a link. Tony
Reply by Carlos Moreno September 1, 20052005-09-01
tony.nospam@nospam.tonyRobinson.com wrote:
> [ crossposed to comp.speech.research as you'll find a lot of > knowledgeable people there]
Thanks! That's a good idea!
> [...] > > If your final goal is VQ on the residual you are adding in a lot of > other processing steps. Do you know how you are going to allocate the > number of bits for each step? Do you have an idea as to what you'd like > the final bit rate to be?
No, not really. The important thing from the point of view of the research project I'm working on is that I need to apply VQ with a given bitrate and then try some modifications (which is the core of my project -- evaluating the effectiveness of this "variation" of the VQ technique) and see how the quality of the output compares at the same bitrate. The encoding I was trying is not the core of my project -- it was supposed to be an auxiliary trick I was hoping would make the VQ technique simpler and more efficient... (it doesn't seem to be the case :-( )
> What happens if you just use the four largest values as the residual? > For voiced speech I'd expect reasonable output.
Interesting variation. In fact, when I feed the reconstruction filter with the peaks added to the best-fitting polynomial (which is in essence sort of an "ultra low-pass-filtered" version, an ultra-smoothed version), then I get "reasonable" output (i.e., perfectly intelligible, but with an artificial quality, quite uneasy to the ear)
> Are there two 10th order polynomials involved here? If so, what is the > first doing? Have you tried "just finding the variance" - if not, then > it's always best to do the simple things first.
Yes, I had originally tried with just the variance; when seeing that it didn't work nicely at all, I then observed that the apparent amplitude of the noise was not constant. In particular, near the peaks, there seems to consistently be more activity; That's why I figured it might be worth trying to model the "amplitude of the noise" as a function of the position in the frame. This basically leads to a best-fitting polynomial for the square deviation -- that gives me the best approximation of the "variance as a function of the position", and then I use that when generating the pseudo-random noise.
> This sounds like you have two polynomials - I'd be surprised if the mean > of the residuals is far from zero - why not post plots of the residual?
I did. In the link I posted, the green signal is the actual residual, and the red one is the reconstructed one: http://www.mochima.com/tmp/residual.png (I hope I'm understanding correctly your question)
> Good. However I still think you are trying to do too much at once. For > clean speech you can get away with a pulse train for voiced excitation > and uniform random noise for unvoiced and it doesn't sound too bad - > better then what you are describing.
Hmmm... I wonder if we're having different ideas/thresholds in our descriptions of "sounds ok" or "sounds horrible"... Perhaps I should have posted samples of the sounds? :-) As I said in my reply to Rune, when trying the "naive" VQ approach directly taking the entire residual as a 128-dimensional vector (well, 64-dimensional, as I tried reducing the frame size to 64), even though the "transcoded" residual looks extremely different, the quality of the reconstructed audio is astonishing. And I mean *really* better, as in comparing a badly-tuned AM radio to a regular-quality FM. I'm quite puzzled, because *to the eye*, the approximation that I'm getting from the parametric representation looks *infinitely better* than what I'm getting from the standard VQ; however, the actual sound that I get is quite the opposite: the VQ version sounds infinitely better than mine...
> Also - have you read the literature on CELP coders?
My advisor suggested that when I talked to him about this. From what I could understand, it sounded a lot like an analysis-by-synthesis Vector Quantizer, but with the codebook artificially generated?
> I like your project - I hope you enjoy it.
Well, I'm enjoying it a lot! Unfortunately, I can not disclose the other part (the presumably important part), at least until I finish writing the thesis and the results and the idea is published (I mean, I don't know if I can, but I prefer not to take the risk of getting in trouble with McGill ;-)). Thanks for your comments! Your message is highly appreciated! Carlos --
Reply by Carlos Moreno September 1, 20052005-09-01
Rune Allnor wrote:

>>Any comments? I know it was a quite longish post, but I'm hoping >>some kind sould will have the patience to go through it and share >>some thoughts. > > I wrote a somewhat elaborate reply to this post, but google > had a complain about my PC when I tried to post it.
:-(((( (the sad face is not because I missed the detailed reply, but because I feel bad that you took the time and the effort to reply, and stupid Google (or stupid Windows, or stupid browser) just threw it away :-( )
> You introduce lots of elaborate steps, compared to the "nive" > LPC encoder. Why not start with just the white noise excitation > signal for the reconstruction filter, and add your additional > steps one by one. That way you might be able to isolate one > or two operations that produce bad effects in the reconstructed > data.
I've tried that as part of my debugging steps. With just white noise, it sounds intelligible, but quite scratchy (like someone with a *really* bad cold, or rather, a bad case of laryngitis or something like that) With just the pulses and the "best fitting polynomial" for the mean (i.e., the best-fitting polynomial for the frame), it sounds more or less ok, but it gets an "artificial" quality; like a "mechanical" synthesized voice (no, not completely robot-like from the 50s and 60s movies -- that happens when I feed it with a train of pulses at constant frequency :-)). When I put the whole thing together, well, it's still *very* intelligible, but it just sounds extremely different and with artifacts. I've kept working on it and discovered that, while there was no bugs on the reconstruction filtering part, there was a bug that made me get the wrong value for the peaks under some circumstances. But it was minor -- it was once every several thousands frames that I would get the wrong value for the peaks. I fixed it, but didn't really change much. Now, something really surprises me. I temporarily put aside the approach of parameterizing the residual before VQ'ing it, and tried the "naive" approach of VQ. I reduced the frame size to 64 samples, and just used the square error as a distance criterion, and used straight average (i.e., sample-by-sample average) to compute the mean values (the centroids of the clusters) for the VQ codebook. The results were astonishing (to me) in two senses: 1) the quality of the output (the transcoded speech) was really good. And 2) I was astonished to see how different the VQ'ed residuals were from the actual residuals. The only detail *more or less* consistent was the peaks and the values around the peaks. For the rest, it was horribly different. That makes me wonder if perhaps the fact that I'm missing the values and the high correlation between the peaks and the few samples following the peak is what's making the huge difference with my approach? Anyway, I guess I'll keep working on the "standard" VQ approach, since I have several other things to test at this point (this encoding thing was supposed to make it simpler for me to do the VQ, but it doesn't seem to be the case so far :-( ). Thanks! Carlos --
Reply by September 1, 20052005-09-01
[ crossposed to comp.speech.research as you'll find a lot of
knowledgeable people there]

Carlos Moreno <moreno_at_mochima_dot_com@xx.xxx> writes:

> As part of my research project, I'm working on vector-quantization > applied to an LPC-like encoding scheme. > > For each frame of speech (I tentatively put a frame length of 128 > samples -- at 8kHz sampling rate), I compute the optimal prediction > filter (using the autocorrelation method), and then apply that filter > to the signal (well, I apply 1-P to the signal, where P(z) is the > optimal prediction filter).
...
> So, I end up with the residual, and I want to encode this. The > final goal is to do VQ on that residual.
If your final goal is VQ on the residual you are adding in a lot of other processing steps. Do you know how you are going to allocate the number of bits for each step? Do you have an idea as to what you'd like the final bit rate to be?
> I notice that the residual consists of a few peaks plus noise, > so I figured: I'll encode the amplitude and positions of up to > 4 peaks (most frames have 1 peak, many have 2 or 3, and many of > them have no peaks at all), and then, for the rest, I'll determine > the best fitting 10th-order polynomial to represent the data.
What happens if you just use the four largest values as the residual? For voiced speech I'd expect reasonable output. What happens if you use more values, in the limit of 128 it should work perfectly as you'll have the unquanitised residual.
> Then, the rest should be almost pure "whitish" noise, only that > the amplitude of that noise varies over the frame. So, what I > do is that instead of just finding the variance (a constant, > average value for the entire frame), I compute the best-fitting > 10th-order polynomial for the square values of the remaining > (0-mean) signal.
Are there two 10th order polynomials involved here? If so, what is the first doing? Have you tried "just finding the variance" - if not, then it's always best to do the simple things first.
> To reconstruct, I just add the mean (the value of the best-fitting > polynomial for the data) with a pseudo-random Gaussian number, > with variance given by the polynomial that gives me the "amplitude" > of the noise as a function of the position in the frame.
This sounds like you have two polynomials - I'd be surprised if the mean of the residuals is far from zero - why not post plots of the residual?
> No, it is not a bug in the implementation (I'm actually quite > positively certain of that -- really). If I keep the same frame > and pass it through the reconstruction filter, I obtain the exact > same audio stream (sample-by-sample the exact same values).
Good. However I still think you are trying to do too much at once. For clean speech you can get away with a pulse train for voiced excitation and uniform random noise for unvoiced and it doesn't sound too bad - better then what you are describing. I'd pick the n biggest samples, encode those exactly and model the rest with uniform random noise. Once that sounds okay you'll have a baseline to compare to for your ideas with 10th order polynomials. Or think about it another way, if I was an external examiner on your research project I'd be asking what baseline systems you developed from and expect a good answer. Also - have you read the literature on CELP coders? I like your project - I hope you enjoy it. Tony
Reply by Rune Allnor September 1, 20052005-09-01
Carlos Moreno wrote:

> Any comments? I know it was a quite longish post, but I'm hoping > some kind sould will have the patience to go through it and share > some thoughts.
I wrote a somewhat elaborate reply to this post, but google had a complain about my PC when I tried to post it. If the original post does not appear, here is the short version: You introduce lots of elaborate steps, compared to the "nive" LPC encoder. Why not start with just the white noise excitation signal for the reconstruction filter, and add your additional steps one by one. That way you might be able to isolate one or two operations that produce bad effects in the reconstructed data. Rune