Carlos Moreno <moreno_at_mochima_dot_com@xx.xxx> writes:

> > Okay - what I was trying to get at here is that it is easy to eat bits,
> > especially if you have several different things to quantise.  I've also
> > written an music coder which allocates some bits to linear prediction
> > coefficients, some to single impules and some to VQ codebooks of the
> > remaining residual (in my case the impulses and residual was in the
> > Fourier domain, but that's not important).  Whilst both the impulses and
> > VQ codebooks both need bits allocated to them, they are of different
> > types so it's not obvious how many bits to allocate to each
> 
> Yes, this was at some point a big concern  (all of the ideas I had to
> approach it sounded too "empiric"), but then I figured that even if I
> get a sub-optimal encoder, as long as that encoder is the same for both
> cases (with the standard technique and with the modifications I want
> to try), I should be ok...  I know that that is not entirely true,
> since the effectiveness of the method may be affected by what's
> happening before...

If you take out your two tenth order polynomials and the impulse coder
you have a standard residual excited linear predictor.  You say that
your main project is to so something new (which you haven't disclosed)
to this framework.  It seems you are exploring areas not in the project
spec - that maybe very good but I've seen a lot of projects fail badly
because other areas were explored and the main goals not met so no
strong conclusions could be reached.  What's more, even if you do
innovate and produce a new CELP style framework, it'll be hard to judge
the original project goals against it as your innovations will take a
while to justify in themselves.  Don't get me wrong, I think you are
having very good ideas, but perhaps there are too many of them all at
once.

> >>http://www.mochima.com/tmp/residual.png
> > Right - I thought I was looking at waveforms here.  I think there is
> > something wrong with the generation of the original (green) residual.
> > Unless my eyes are decieving me, it has a large high frequency component
> > with period of a few samples.
> 
> I don't see it -- perhaps you're talking about the interval after the
> first peak, approx. time index 5550 to 5560?
> 
> I think it's not exactly a pure high-frequency component -- some of the
> peaks do include two samples, some don't.  Being part of a bigger frame
> where *the whole* frame is being minimized, I found it reasonable to
> encounter these.

Okay, I haven't - if you like send me the data.

> I guess this is related to your point of not liking the idea that the
> residual can be modeled by a 10th-order polynomial?  Meaning perhaps
> that if the frame looks like a polynomial, then it means that the
> correlation was not properly eliminated and that would mean that the
> predictor was not optimal?

LP predictors are often non-optimal as the signal changes within the
frame.  Sometimes you really do get a very sharp impulse and small
random noise distributed throughout the rest of the frame, more often
there is a low frequency component around the impulse - if the signal
changes within the frame this this is at a low frequency.  I haven't
seen such a narrowband high frequency component before.

I've other comments, on the use of polynomials (they don't do well at
infinity) and other things - but perhaps best we take this offline?

Tony

Carlos Moreno wrote:

>> Have you tried test signals such as simple
>> sinusiods (which should have very low residuals - zero even if it wasn't
>> for boundary conditions).
> 
> No, I haven't.  I'm going to do it right away -- it's a very good idea.

Follow-up question:  what kind of amplitude for the residual would be
reasonable for this?

I just tried feeding a pure sinusoid with amplitude 5000 (i.e., from
-5000 to +5000), and get a residual with amplitude 10 (from -10 to +10).

The equivalent frequency of the signal would be approximately 3kHz
(at a sampling rate of 8000kHz, I replace the signal with:

speech[i] = 5000*sin(i);

With lower-frequency inputs, like  speech[i] = 5000*sin(i/5.0), I get
a residual with amplitude 5  (from -5 to 5).

With a mixture of three frequencies:

speech[i] = 5000 * (sin(i) + sin(i/5.0) + sin(i/30.0));

The residual goes up -- different frames are different, but the highest
ones go from -150 to +150.  I guess this might still be ok, perhaps
because the filter has to do a trade-off between minimizing the output
for each of the tones?

Does the above sound reasonable?

Thanks,

Carlos
--

Thanks Tony for your interest and your detailed reply!

>>>[...]
>>>If your final goal is VQ on the residual you are adding in a lot of
>>>other processing steps.  Do you know how you are going to allocate the
>>>number of bits for each step?  Do you have an idea as to what you'd like
>>>the final bit rate to be?
>>
>>No, not really.  The important thing from the point of view of the
>>research project I'm working on is that I need to apply VQ with a
>>given bitrate and then try some modifications (which is the core of
>>my project -- evaluating the effectiveness of this "variation" of
>>the VQ technique) and see how the quality of the output compares
>>at the same bitrate.
> 
> 
> Okay - what I was trying to get at here is that it is easy to eat bits,
> especially if you have several different things to quantise.  I've also
> written an music coder which allocates some bits to linear prediction
> coefficients, some to single impules and some to VQ codebooks of the
> remaining residual (in my case the impulses and residual was in the
> Fourier domain, but that's not important).  Whilst both the impulses and
> VQ codebooks both need bits allocated to them, they are of different
> types so it's not obvious how many bits to allocate to each 

Yes, this was at some point a big concern  (all of the ideas I had to
approach it sounded too "empiric"), but then I figured that even if I
get a sub-optimal encoder, as long as that encoder is the same for both
cases (with the standard technique and with the modifications I want
to try), I should be ok...  I know that that is not entirely true,
since the effectiveness of the method may be affected by what's
happening before...

[...]
>>That's why I figured it might be worth trying to model the
>>"amplitude of the noise" as a function of the position in the
>>frame.  This basically leads to a best-fitting polynomial for
>>the square deviation -- that gives me the best approximation
>>of the "variance as a function of the position", and then I
>>use that when generating the pseudo-random noise.
> 
> 
> I'm still not entirely sure what both polynomials do.  I'm assuming that
> one models the residual signal, and the other the variance of this.

Yes.  (well, both polynomials deal with the signal *without* the peaks)


>>http://www.mochima.com/tmp/residual.png
> 
> Right - I thought I was looking at waveforms here.  I think there is
> something wrong with the generation of the original (green) residual.
> Unless my eyes are decieving me, it has a large high frequency component
> with period of a few samples. 

I don't see it -- perhaps you're talking about the interval after the
first peak, approx. time index 5550 to 5560?

I think it's not exactly a pure high-frequency component -- some of the
peaks do include two samples, some don't.  Being part of a bigger frame
where *the whole* frame is being minimized, I found it reasonable to
encounter these.

> Have you tried test signals such as simple
> sinusiods (which should have very low residuals - zero even if it wasn't
> for boundary conditions).

No, I haven't.  I'm going to do it right away -- it's a very good idea.


I guess this is related to your point of not liking the idea that the
residual can be modeled by a 10th-order polynomial?  Meaning perhaps
that if the frame looks like a polynomial, then it means that the
correlation was not properly eliminated and that would mean that the
predictor was not optimal?

Keep in mind that the predictor operates with a small number of taps
(I set it to 15 -- though I guess I could play with that and see how
it affects the residual?).  So, having the "noise" follow a non-zero
curve seemed reasonable  (as well as having an amplitude/variance
that varies over the frame).  That's why I tried to model these with
two polynomials -- perhaps 10th-order is way too much (maybe 5th- or
6th-order would suffice?  But still, if the curve fits perfectly on
a 5th-order polynomial, then the best-fitting 10th-order polynomial
should have coefficients with values 0 above the 6th coefficient;
and if all the frames behave like that, then the Vector Quantiser
should get rid of that -- it should extract information from where
information is valuable).

That's why I didn't try too hard to optimise the polynomials, and
thought it was better to err on the side of overkill in this case.

> Well that's good.  I'm 100% with Rune in that I think it very important
> to get the simplest scheme going first, then add in layers of
> complexity.

Ok.

>>I'm quite puzzled, because *to the eye*, the approximation that I'm
>>getting from the parametric representation looks *infinitely better*
>>than what I'm getting from the standard VQ;  however, the actual
>>sound that I get is quite the opposite:  the VQ version sounds
>>infinitely better than mine...
> 
> Well it's pretty amazing that a single impulse per pitch period or white
> noise can generate perfectly intelligable if not 100% natural speech

Yep!!  It's been more than a year since my "Speech Communications"
course, and I still can't get over that!  :-)


>>>Also - have you read the literature on CELP coders?
>>
>>My advisor suggested that when I talked to him about this.  From what
>>I could understand, it sounded a lot like an analysis-by-synthesis
>>Vector Quantizer, but with the codebook artificially generated?
> 
> I'd defintely look it up - most books on speech/audio coding will cover
> CELP.

Ok, I'll look it up.

>>Well, I'm enjoying it a lot!  Unfortunately, I can not disclose the
>>other part (the presumably important part), at least until I finish
>>writing the thesis and the results and the idea is published  (I
>>mean, I don't know if I can, but I prefer not to take the risk of
>>getting in trouble with McGill  ;-)).
> 
> Sounds good.   If you publish online then please send me a link.

I guess at least I could post a summary of what I tried (describing
the technique) and what I found out.


Thanks again for your comments and suggestions!

Carlos
--

Hi Carlos,

Good to get a long reply.  In a past life I've done and supervised many
LP/VQ projects, some of which worked well, some of which didn't.

Carlos Moreno <moreno_at_mochima_dot_com@xx.xxx> writes:

> tony.nospam@nospam.tonyRobinson.com wrote:
> 
> > [...]
> > If your final goal is VQ on the residual you are adding in a lot of
> > other processing steps.  Do you know how you are going to allocate the
> > number of bits for each step?  Do you have an idea as to what you'd like
> > the final bit rate to be?
> 
> No, not really.  The important thing from the point of view of the
> research project I'm working on is that I need to apply VQ with a
> given bitrate and then try some modifications (which is the core of
> my project -- evaluating the effectiveness of this "variation" of
> the VQ technique) and see how the quality of the output compares
> at the same bitrate.

Okay - what I was trying to get at here is that it is easy to eat bits,
especially if you have several different things to quantise.  I've also
written an music coder which allocates some bits to linear prediction
coefficients, some to single impules and some to VQ codebooks of the
remaining residual (in my case the impulses and residual was in the
Fourier domain, but that's not important).  Whilst both the impulses and
VQ codebooks both need bits allocated to them, they are of different
types so it's not obvious how many bits to allocate to each - and if you
get it wrong then the coder isn't as good as it could have been.  You
also have two 10th order polynomials to include.

> > What happens if you just use the four largest values as the residual?
> > For voiced speech I'd expect reasonable output.
> 
> Interesting variation.  In fact, when I feed the reconstruction
> filter with the peaks added to the best-fitting polynomial (which
> is in essence sort of an "ultra low-pass-filtered" version, an
> ultra-smoothed version), then I get "reasonable" output (i.e.,
> perfectly intelligible, but with an artificial quality, quite
> uneasy to the ear)

I was suggesting making things really simple and only using four peaks -
I don't think your 10th order polynomial should be needed - more on that
later.

> > Are there two 10th order polynomials involved here?  If so, what is the
> > first doing?  Have you tried "just finding the variance" - if not, then
> > it's always best to do the simple things first.
> 
> Yes, I had originally tried with just the variance;  when seeing
> that it didn't work nicely at all, I then observed that the apparent
> amplitude of the noise was not constant.  In particular, near the
> peaks, there seems to consistently be more activity;
>
> That's why I figured it might be worth trying to model the
> "amplitude of the noise" as a function of the position in the
> frame.  This basically leads to a best-fitting polynomial for
> the square deviation -- that gives me the best approximation
> of the "variance as a function of the position", and then I
> use that when generating the pseudo-random noise.

I'm still not entirely sure what both polynomials do.  I'm assuming that
one models the residual signal, and the other the variance of this.

I like the idea that the variance of the residual fluctuates over a
frame of voiced speech - intuitively it should be greater when there is
glottal excitation and less in the closed period.  Off the top of my
head I can't think of codecs that exploit this - perhaps others in these
newsgoroups could contribute references - if they exist.

I don't like the idea that your residual itself can be modelled by a
tenth order polynomial - more on that...

> > This sounds like you have two polynomials - I'd be surprised if the mean
> > of the residuals is far from zero - why not post plots of the residual?
> 
> I did.  In the link I posted, the green signal is the actual
> residual, and the red one is the reconstructed one:
> 
> http://www.mochima.com/tmp/residual.png

Right - I thought I was looking at waveforms here.  I think there is
something wrong with the generation of the original (green) residual.
Unless my eyes are decieving me, it has a large high frequency component
with period of a few samples.  The whole idea of the linear prediction
is to remove correlations over this timescale so I would guess that this
hasn't operated correctly.  Have you tried test signals such as simple
sinusiods (which should have very low residuals - zero even if it wasn't
for boundary conditions).

> > Good.  However I still think you are trying to do too much at once.  For
> > clean speech you can get away with a pulse train for voiced excitation
> > and uniform random noise for unvoiced and it doesn't sound too bad -
> > better then what you are describing.
> 
> Hmmm...  I wonder if we're having different ideas/thresholds in our
> descriptions of "sounds ok" or "sounds horrible"...  Perhaps I should
> have posted samples of the sounds?  :-)

Seems like you have access to a web site where you could.

> As I said in my reply to Rune, when trying the "naive" VQ approach
> directly taking the entire residual as a 128-dimensional vector (well,
> 64-dimensional, as I tried reducing the frame size to 64), even though
> the "transcoded" residual looks extremely different, the quality of
> the reconstructed audio is astonishing.  And I mean *really* better,
> as in comparing a badly-tuned AM radio to a regular-quality FM.

Well that's good.  I'm 100% with Rune in that I think it very important
to get the simplest scheme going first, then add in layers of
complexity.

> I'm quite puzzled, because *to the eye*, the approximation that I'm
> getting from the parametric representation looks *infinitely better*
> than what I'm getting from the standard VQ;  however, the actual
> sound that I get is quite the opposite:  the VQ version sounds
> infinitely better than mine...

Well it's pretty amazing that a single impulse per pitch period or white
noise can generate perfectly intelligable if not 100% natural speech
(given clean input).  In this case the synthetic excitation looks
nothing like the real thing, but the spectral properties are preserved
and that seems to be what is important.  Indeed, all of speech
recognition is built on this, only the spectral envelope is used and all
of the excitaton signal is thrown away.

> > Also - have you read the literature on CELP coders?
> 
> My advisor suggested that when I talked to him about this.  From what
> I could understand, it sounded a lot like an analysis-by-synthesis
> Vector Quantizer, but with the codebook artificially generated?

I'd defintely look it up - most books on speech/audio coding will cover
CELP.

> > I like your project - I hope you enjoy it.
> 
> Well, I'm enjoying it a lot!  Unfortunately, I can not disclose the
> other part (the presumably important part), at least until I finish
> writing the thesis and the results and the idea is published  (I
> mean, I don't know if I can, but I prefer not to take the risk of
> getting in trouble with McGill  ;-)).

Sounds good.   If you publish online then please send me a link.

Tony

tony.nospam@nospam.tonyRobinson.com wrote:
> [ crossposed to comp.speech.research as you'll find a lot of
> knowledgeable people there]

Thanks!  That's a good idea!

> [...]
> 
> If your final goal is VQ on the residual you are adding in a lot of
> other processing steps.  Do you know how you are going to allocate the
> number of bits for each step?  Do you have an idea as to what you'd like
> the final bit rate to be?

No, not really.  The important thing from the point of view of the
research project I'm working on is that I need to apply VQ with a
given bitrate and then try some modifications (which is the core of
my project -- evaluating the effectiveness of this "variation" of
the VQ technique) and see how the quality of the output compares
at the same bitrate.

The encoding I was trying is not the core of my project -- it was
supposed to be an auxiliary trick I was hoping would make the VQ
technique simpler and more efficient...  (it doesn't seem to be
the case  :-( )

> What happens if you just use the four largest values as the residual?
> For voiced speech I'd expect reasonable output.

Interesting variation.  In fact, when I feed the reconstruction
filter with the peaks added to the best-fitting polynomial (which
is in essence sort of an "ultra low-pass-filtered" version, an
ultra-smoothed version), then I get "reasonable" output (i.e.,
perfectly intelligible, but with an artificial quality, quite
uneasy to the ear)

> Are there two 10th order polynomials involved here?  If so, what is the
> first doing?  Have you tried "just finding the variance" - if not, then
> it's always best to do the simple things first.

Yes, I had originally tried with just the variance;  when seeing
that it didn't work nicely at all, I then observed that the apparent
amplitude of the noise was not constant.  In particular, near the
peaks, there seems to consistently be more activity;

That's why I figured it might be worth trying to model the
"amplitude of the noise" as a function of the position in the
frame.  This basically leads to a best-fitting polynomial for
the square deviation -- that gives me the best approximation
of the "variance as a function of the position", and then I
use that when generating the pseudo-random noise.

> This sounds like you have two polynomials - I'd be surprised if the mean
> of the residuals is far from zero - why not post plots of the residual?

I did.  In the link I posted, the green signal is the actual
residual, and the red one is the reconstructed one:

http://www.mochima.com/tmp/residual.png

(I hope I'm understanding correctly your question)

> Good.  However I still think you are trying to do too much at once.  For
> clean speech you can get away with a pulse train for voiced excitation
> and uniform random noise for unvoiced and it doesn't sound too bad -
> better then what you are describing.

Hmmm...  I wonder if we're having different ideas/thresholds in our
descriptions of "sounds ok" or "sounds horrible"...  Perhaps I should
have posted samples of the sounds?  :-)

As I said in my reply to Rune, when trying the "naive" VQ approach
directly taking the entire residual as a 128-dimensional vector (well,
64-dimensional, as I tried reducing the frame size to 64), even though
the "transcoded" residual looks extremely different, the quality of
the reconstructed audio is astonishing.  And I mean *really* better,
as in comparing a badly-tuned AM radio to a regular-quality FM.

I'm quite puzzled, because *to the eye*, the approximation that I'm
getting from the parametric representation looks *infinitely better*
than what I'm getting from the standard VQ;  however, the actual
sound that I get is quite the opposite:  the VQ version sounds
infinitely better than mine...

> Also - have you read the literature on CELP coders?

My advisor suggested that when I talked to him about this.  From what
I could understand, it sounded a lot like an analysis-by-synthesis
Vector Quantizer, but with the codebook artificially generated?

> I like your project - I hope you enjoy it.

Well, I'm enjoying it a lot!  Unfortunately, I can not disclose the
other part (the presumably important part), at least until I finish
writing the thesis and the results and the idea is published  (I
mean, I don't know if I can, but I prefer not to take the risk of
getting in trouble with McGill  ;-)).

Thanks for your comments!  Your message is highly appreciated!

Carlos
--

Rune Allnor wrote:

>>Any comments?  I know it was a quite longish post, but I'm hoping
>>some kind sould will have the patience to go through it and share
>>some thoughts.
> 
> I wrote a somewhat elaborate reply to this post, but google
> had a complain about my PC when I tried to post it. 

:-((((

(the sad face is not because I missed the detailed reply, but
because I feel bad that you took the time and the effort to reply,
and stupid Google (or stupid Windows, or stupid browser) just
threw it away  :-( )

> You introduce lots of elaborate steps, compared to the "nive"
> LPC encoder. Why not start with just the white noise excitation
> signal for the reconstruction filter, and add your additional
> steps one by one. That way you might be able to isolate one
> or two operations that produce bad effects in the reconstructed
> data.

I've tried that as part of my debugging steps.  With just white
noise, it sounds intelligible, but quite scratchy (like someone
with a *really* bad cold, or rather, a bad case of laryngitis or
something like that)

With just the pulses and the "best fitting polynomial" for the
mean (i.e., the best-fitting polynomial for the frame), it sounds
more or less ok, but it gets an "artificial" quality;  like a
"mechanical" synthesized voice  (no, not completely robot-like
from the 50s and 60s movies -- that happens when I feed it with
a train of pulses at constant frequency  :-)).

When I put the whole thing together, well, it's still *very*
intelligible, but it just sounds extremely different and with
artifacts.

I've kept working on it and discovered that, while there was
no bugs on the reconstruction filtering part, there was a bug
that made me get the wrong value for the peaks under some
circumstances.  But it was minor -- it was once every several
thousands frames that I would get the wrong value for the
peaks.  I fixed it, but didn't really change much.

Now, something really surprises me.  I temporarily put aside
the approach of parameterizing the residual before VQ'ing it,
and tried the "naive" approach of VQ.  I reduced the frame size
to 64 samples, and just used the square error as a distance
criterion, and used straight average (i.e., sample-by-sample
average) to compute the mean values (the centroids of the
clusters) for the VQ codebook.

The results were astonishing (to me) in two senses:  1) the
quality of the output (the transcoded speech) was really good.
And 2) I was astonished to see how different the VQ'ed residuals
were from the actual residuals.  The only detail *more or less*
consistent was the peaks and the values around the peaks.  For
the rest, it was horribly different.

That makes me wonder if perhaps the fact that I'm missing
the values and the high correlation between the peaks and
the few samples following the peak is what's making the huge
difference with my approach?

Anyway, I guess I'll keep working on the "standard" VQ approach,
since I have several other things to test at this point (this
encoding thing was supposed to make it simpler for me to do
the VQ, but it doesn't seem to be the case so far  :-(  ).

Thanks!

Carlos
--

[ crossposed to comp.speech.research as you'll find a lot of
knowledgeable people there]

Carlos Moreno <moreno_at_mochima_dot_com@xx.xxx> writes:

> As part of my research project, I'm working on vector-quantization
> applied to an LPC-like encoding scheme.
> 
> For each frame of speech (I tentatively put a frame length of 128
> samples -- at 8kHz sampling rate), I compute the optimal prediction
> filter (using the autocorrelation method), and then apply that filter
> to the signal (well, I apply 1-P to the signal, where P(z) is the
> optimal prediction filter).
...
> So, I end up with the residual, and I want to encode this.  The
> final goal is to do VQ on that residual.

If your final goal is VQ on the residual you are adding in a lot of
other processing steps.  Do you know how you are going to allocate the
number of bits for each step?  Do you have an idea as to what you'd like
the final bit rate to be?

> I notice that the residual consists of a few peaks plus noise,
> so I figured:  I'll encode the amplitude and positions of up to
> 4 peaks (most frames have 1 peak, many have 2 or 3, and many of
> them have no peaks at all), and then, for the rest, I'll determine
> the best fitting 10th-order polynomial to represent the data.

What happens if you just use the four largest values as the residual?
For voiced speech I'd expect reasonable output.  What happens if you use
more values, in the limit of 128 it should work perfectly as you'll have
the unquanitised residual.

> Then, the rest should be almost pure "whitish" noise, only that
> the amplitude of that noise varies over the frame.  So, what I
> do is that instead of just finding the variance (a constant,
> average value for the entire frame), I compute the best-fitting
> 10th-order polynomial for the square values of the remaining
> (0-mean) signal.

Are there two 10th order polynomials involved here?  If so, what is the
first doing?  Have you tried "just finding the variance" - if not, then
it's always best to do the simple things first.

> To reconstruct, I just add the mean (the value of the best-fitting
> polynomial for the data) with a pseudo-random Gaussian number,
> with variance given by the polynomial that gives me the "amplitude"
> of the noise as a function of the position in the frame.

This sounds like you have two polynomials - I'd be surprised if the mean
of the residuals is far from zero - why not post plots of the residual?

> No, it is not a bug in the implementation (I'm actually quite
> positively certain of that -- really).  If I keep the same frame
> and pass it through the reconstruction filter, I obtain the exact
> same audio stream (sample-by-sample the exact same values).

Good.  However I still think you are trying to do too much at once.  For
clean speech you can get away with a pulse train for voiced excitation
and uniform random noise for unvoiced and it doesn't sound too bad -
better then what you are describing.

I'd pick the n biggest samples, encode those exactly and model the rest
with uniform random noise.  Once that sounds okay you'll have a baseline
to compare to for your ideas with 10th order polynomials.  Or think
about it another way, if I was an external examiner on your research
project I'd be asking what baseline systems you developed from and
expect a good answer.

Also - have you read the literature on CELP coders?

I like your project - I hope you enjoy it.

Tony