Forums

LPC and voice reconstruction

Started by Sergio June 8, 2006

Hello everybody.

I'm doing some programs that treat human speech. As part of my
project, I want to do a module that extracts the LPC coefficients of a
piece of speech, and I'd want to find an algorithm to reconstruct a
waveform that could sound, aproximately, as the original sound that I
take my LPC coefficientes from.

I've recorded my voice, saying a vowel, for example, something like
eeeee, at 44100 of sampling frequency and 16 bits per sample. I record
it in a RIFF WAVE ( .wav ) file, and then , I load it in a memory
buffer in a program. Being the sound, approximately, a periodic signal
that repeats over time I take a piece in the middle of the buffer,
where I do LPC coefficients determination using the algorithm
explained in Numerical Recipes in C, chapter "Linear prediction and
linear prediction coding". I use the 'memcof' function, with a number
of coefficients 13, so I get, from a piece of 882 samples between
-32768 and 32767, 13 LP coefficients. 

In that situation, I thouht that taking the first 13 samples of my
signal and applying them the recursion with my 13 LP coefficients, and
continuing that process, I would get a signal that, for many cicles,
would resemble, and sound similar to, a voice saying eeee... But it
isn't that way at all. Taking as the 13 initial values 13 true values
of the signal (me saying eeee) and applying the recursion with the 13
LP coefficients with an algorithm that does exactly what the 'predic'
algorithm of Numerical Recipes do, I get a signal that QUICKLY DECAYS
TO ZERO. It seems a damped down sinusoid, that takes some values over
zero, crosses zero, takes some negative values, and after a negative
minimal point, increases again, quickly tending towards 0.

Of course, with the idea in NR, saying that you can reconstruct a
speech signal "driving  these coefficients with initial conditions
consisting of zeros except for one nonzero spike" presents the same
problem. I've tried to add to the signal a value of 10000 every some
number of samples, for example, one spike every 441 samples, to
introduce an impulse excitation of frequency 100, or every one hundred
and something, to put an impulse at frequency three hundred Hzs, more
or less, but the result is a signal that takes significant values in
the samples after the impulse, then decaying, once again, to zero,
quickly, and reproduced the signal as a .wav (saving it to disk with
the necesary headers, of course) , it sounds similar to a pure sound
of the frequency of the spikes. I've tried to sum to every sample, the
value of a sinusoid, plus the value of the recursion, but I get,
again, a sound that resembles only the sinusoidal wave I introduce.

What have I done wrong? Are my LP coefficient possibly wrong ? ( I
don't think so). Should I use another way to reconstruct the waveform?
How should I do it?

Another way I've thought of, a way I'm wishing to try in the days to
come, would be using the LP coefficients to calculate the power of the
signal at 32768 differents frequencies, between 0 and 1/2 ( of the
Nyquist frecuency). Copy that frequencies in other 32768 succesive
values, to verify H(-f)=[H(f)]* , and apply to the 65536 values I'd
get an Inverse DFT, to get 65536 consecutive values of a waveform that
would contain the spectral characteristics determined by my LP
coefficientes. Possibly multiplying, in the frequency domain, that
values by the discrete fourier transform of an excitation signal (for
example, for voiced sounds, as me saying eee, the FFT of a signal
consisting of impulses every 1/200 of a second, to give an excitation
signal of frequency 200 Hzs, or, for unvoiced sounds, multiplying the
spectral characteristics determined by my LPCs by the FFT of a signal
in the time domain containing white noise).

Is it a promising line, or is it a mad idea?

I'm disposed to send a copy of my source codes, windows exe's, screen
captures of my waveform etc to anybody desiring to help, of course.

Thank you very much in advance,


Sergio


Sergio wrote:
> Hello everybody. > > I'm doing some programs that treat human speech. As part of my > project, I want to do a module that extracts the LPC coefficients of a > piece of speech, and I'd want to find an algorithm to reconstruct a > waveform that could sound, aproximately, as the original sound that I > take my LPC coefficientes from. > > I've recorded my voice, saying a vowel, for example, something like > eeeee, at 44100 of sampling frequency and 16 bits per sample. I record > it in a RIFF WAVE ( .wav ) file, and then , I load it in a memory > buffer in a program. Being the sound, approximately, a periodic signal > that repeats over time I take a piece in the middle of the buffer, > where I do LPC coefficients determination using the algorithm > explained in Numerical Recipes in C, chapter "Linear prediction and > linear prediction coding". I use the 'memcof' function, with a number > of coefficients 13, so I get, from a piece of 882 samples between > -32768 and 32767, 13 LP coefficients. > > In that situation, I thouht that taking the first 13 samples of my > signal and applying them the recursion with my 13 LP coefficients, and > continuing that process, I would get a signal that, for many cicles, > would resemble, and sound similar to, a voice saying eeee... But it > isn't that way at all. Taking as the 13 initial values 13 true values > of the signal (me saying eeee) and applying the recursion with the 13 > LP coefficients with an algorithm that does exactly what the 'predic' > algorithm of Numerical Recipes do, I get a signal that QUICKLY DECAYS > TO ZERO. It seems a damped down sinusoid, that takes some values over > zero, crosses zero, takes some negative values, and after a negative > minimal point, increases again, quickly tending towards 0. > > Of course, with the idea in NR, saying that you can reconstruct a > speech signal "driving these coefficients with initial conditions > consisting of zeros except for one nonzero spike" presents the same > problem. I've tried to add to the signal a value of 10000 every some > number of samples, for example, one spike every 441 samples, to > introduce an impulse excitation of frequency 100, or every one hundred > and something, to put an impulse at frequency three hundred Hzs, more > or less, but the result is a signal that takes significant values in > the samples after the impulse, then decaying, once again, to zero, > quickly, and reproduced the signal as a .wav (saving it to disk with > the necesary headers, of course) , it sounds similar to a pure sound > of the frequency of the spikes. I've tried to sum to every sample, the > value of a sinusoid, plus the value of the recursion, but I get, > again, a sound that resembles only the sinusoidal wave I introduce. > > What have I done wrong? Are my LP coefficient possibly wrong ? ( I > don't think so). Should I use another way to reconstruct the waveform? > How should I do it?
The behaviour of the LPC oscillator depends on the method used to compute the prediction coefficients. Some result in stable minimum phase filters, others don't (but have better spectral estimation characteristics). The reason for the fast decay of the stable filters is due to the flattening of the spectrum of AR processes measured with additive noise (which turns an AR into an ARMA process). The poles recede from the unit circle, causing the impulse response to decay faster. Another serious issue is the initialization of the oscillator (LPC filter) states - some methods for linear predcition coefficient computation result in filters that are highly sensitive to noisy initial values (resulting in near instable extrapolation behaviour), whereas other methods are less sensitive. There are several ways to combat the fast decay of the LPC oscillation filters: 1. Warped LPC 2. Expanding the (relevant) poles back towards the unit circle. 3. Drastically increasing the prediction order (order of hundreds instead of tens). 4. Combination of short and long term predictors (where the long term predictor has a prediction step on the order of the fundamental pitch of the signal, and the short term predictor is a single step predictor). All the above methods result in exponentially fast decaying extrapolation signals, however (unless you decide to use an LPC method which does not result in a stable prediction filters, such as covariance method). All they do is to decrease the exponential decay rate. Such extrapolators are usually only good for a couple of hunderd samples. Using forward / backward interpolation, maybe up to a thousand samples (depending on the signal). If you want to generate several seconds of "eeee" sounds from your sample, you'll have to go for sinusoidal modeling. Regards, Andor