Reply by George Johnson May 21, 20062006-05-21
"Jack" <NOSPAM@THANK.YOU> wrote in message
news:446c61ed$0$15794$14726298@news.sunsite.dk...
| Hello,
|
| I have read some literature about LPC analysis as a tool
| for estimating the parameters of a source-filter speech model.
|
| According to the model, voiced speech can be modelled
| as s[k]=u[k]G[z]V[z]R[z] where G is a glottal pulse filter,
| V is vocal-tract filter and R is a radiation filter. G has 2
| poles. V has 10 poles. R has 1 zero that cancels one
| of the poles of G. Pre-emphasizing voiced speech with
| a filter P=R cancels the remaining pole of G such that voiced,
| pre-emphasized speech is the response of the vocal-tract
| filter alone. The vocal tract filter is said to be driven by
| an excitation sequence which is a series of impulses with
| pitch frequency 1/T. In the
| frequency domain, the excitation signal can be regarded as
| a sampled version of a truly white signal.
|
| If analyze 30ms pre-emphasized, voiced speech segments
| I can estimate the parameters of the V quite well depending
| on T. The larger T is the better.
|
| However, unvoiced speech is modelled as s[k]=u[k]V[z]R[z]
| where u[k] is random noise. This model is a zero/pole model.
| I can justify using LPC analysis by saying that this model which
| has P poles can be approximated by an all-pole model with Q>P
| poles. But then the model is no longer a model of the vocal tract
| and the LPC estimates are no longer estimates of the coefficients
| of V.
|
| The problem even becomes worse when LPC analysis is used
| to estimate V based on noisy, pre-emphasized speech segments.
| Then there is no justification for using LPC analysis to estimate
| V.
|
| Or am I mistaken?

    BZZZT!!!

    Wrong.  You'd be better filtering vocal frequencies into their invidual
frequency ranges (like the human ear does) and compressing those individual
waveforms (as you can filter out noise completely this way) as a group.


Reply by George Johnson May 21, 20062006-05-21
"Matt Mahoney" <matmahoney@yahoo.com> wrote in message
news:1147966050.144318.315640@j73g2000cwa.googlegroups.com...
| Jack wrote:
| > Hello,
| >
| > I have read some literature about LPC analysis as a tool
| > for estimating the parameters of a source-filter speech model.
| >
| > According to the model, voiced speech can be modelled
| > as s[k]=u[k]G[z]V[z]R[z] where G is a glottal pulse filter,
| > V is vocal-tract filter and R is a radiation filter. G has 2
| > poles. V has 10 poles. R has 1 zero that cancels one
| > of the poles of G. Pre-emphasizing voiced speech with
| > a filter P=R cancels the remaining pole of G such that voiced,
| > pre-emphasized speech is the response of the vocal-tract
| > filter alone. The vocal tract filter is said to be driven by
| > an excitation sequence which is a series of impulses with
| > pitch frequency 1/T. In the
| > frequency domain, the excitation signal can be regarded as
| > a sampled version of a truly white signal.
| >
| > If analyze 30ms pre-emphasized, voiced speech segments
| > I can estimate the parameters of the V quite well depending
| > on T. The larger T is the better.
| >
| > However, unvoiced speech is modelled as s[k]=u[k]V[z]R[z]
| > where u[k] is random noise. This model is a zero/pole model.
| > I can justify using LPC analysis by saying that this model which
| > has P poles can be approximated by an all-pole model with Q>P
| > poles. But then the model is no longer a model of the vocal tract
| > and the LPC estimates are no longer estimates of the coefficients
| > of V.
| >
| > The problem even becomes worse when LPC analysis is used
| > to estimate V based on noisy, pre-emphasized speech segments.
| > Then there is no justification for using LPC analysis to estimate
| > V.
| >
| > Or am I mistaken?
|
| I am not an expert on speech compression, but I think a better approach
| (assuming you want lossy compression) would be to remove as much noise
| as you can before LPC and transmit the noise spectrum separately.
| Noise is not compressible.  At the receiving end you generate random
| noise and filter it according to the spectral information you sent.
|
| -- Matt Mahoney

    OR a person could convert the speech into a image of the soundwave
(scaled to match existing waveforms of the key syllables) and run a quick
image compare routine with some stretching of the potential syllable with
the comparative base syllable.  The big problem with the simple image
compare is logically within overlapping human speech (2 people talking at
once).  With a basic loose "Hit" compare percent of match to waveform (it is
a cheap noise filter) then the routine is very fast without much computation

    The bonus is in that image compares are damn quicker now and computer
memory is dirt cheap.

    Now if you want something with a cleaner hit rate, faster response, more
accurate function, and truer-to-human sound recognition... use the same
technique that a human ear uses.

    Filter the waveform into distinct frequency ranges over time.  Then
match loose image compares by hits along existing vocal frequency ranges to
ignore noise factors.  The human ear "hears" a large range of specific audio
frequencies all at once over the standard microphone timeslice model.  Each
microhair in the inner ear "hears" only one distinct frequency range like an
audio filter.  If one of those microhairs dies, then a human loses that
specific frequency range until the microhair is regrown.  In this manner,
musical audio would have zip in the way of speech-interferring noise because
it rarely hits the specific vocal frequencies of speech.  A word spoken with
background audio music would "hit" a higher percentage of the comparitive
images of normal human vowels.

    A true-frequency compare would filter out background noise, but since
computers these days are highly image-focused there is no logical reason to
ignore that available power for usage.  A basic image compare along distinct
vocal frequencies would logically be dramatically faster than waveform
comparison with a much smaller compare range of speach-frequency images.

    Of course, you say to yourself, well, it would still have problems with
two humans chatting in a room.  Well, the truth is in that is where any
logical application would REQUIRE more than one microphone.  With multiple
microphones in play, the distinct multiple vocal frequencies of 2 people
would be filterable by strength of audio and would require another
application which is prioritized with tracking in 3D space where these
multiple human are physically in the room as they move about.  Once that is
done, the audio streams can be isolated and compared individually for each
human sound emitter.  Preferably a decent application would have up to 5
active microphones in play recording multiple audio streams filtered into a
virtual 3D space so that humans can be tracked and their audio waveforms
compared.  The best routine for simplicity of application and function would
be to go with "position aware" microphones each using an internal WI-FI
locater to determine relative distances and specific positions (as one
cannot assume that the microphones would be stationary and unmoved at all
times).  However, since ideal situations for quicker computation rarely come
into play, the cheapest function is to assume that a basic cheapo single
stereo microphone is in play and encourage the customer to upgrade to a
superior audio recording system with options for that system available one
it is installed.  Considering that a basic 3D positional recognition routine
comes with the better audio software, adapting it to superior vocal
recognition functions would be relatively simple if one hires the right
programmers.

    As you can guess, even a simple plastics factory worker (it pays the
bills for my other hobbies) with an IQ of 150 can see the painfully obvious
when it is presented.


Reply by Matt Mahoney May 18, 20062006-05-18
Jack wrote:
> Hello, > > I have read some literature about LPC analysis as a tool > for estimating the parameters of a source-filter speech model. > > According to the model, voiced speech can be modelled > as s[k]=u[k]G[z]V[z]R[z] where G is a glottal pulse filter, > V is vocal-tract filter and R is a radiation filter. G has 2 > poles. V has 10 poles. R has 1 zero that cancels one > of the poles of G. Pre-emphasizing voiced speech with > a filter P=R cancels the remaining pole of G such that voiced, > pre-emphasized speech is the response of the vocal-tract > filter alone. The vocal tract filter is said to be driven by > an excitation sequence which is a series of impulses with > pitch frequency 1/T. In the > frequency domain, the excitation signal can be regarded as > a sampled version of a truly white signal. > > If analyze 30ms pre-emphasized, voiced speech segments > I can estimate the parameters of the V quite well depending > on T. The larger T is the better. > > However, unvoiced speech is modelled as s[k]=u[k]V[z]R[z] > where u[k] is random noise. This model is a zero/pole model. > I can justify using LPC analysis by saying that this model which > has P poles can be approximated by an all-pole model with Q>P > poles. But then the model is no longer a model of the vocal tract > and the LPC estimates are no longer estimates of the coefficients > of V. > > The problem even becomes worse when LPC analysis is used > to estimate V based on noisy, pre-emphasized speech segments. > Then there is no justification for using LPC analysis to estimate > V. > > Or am I mistaken?
I am not an expert on speech compression, but I think a better approach (assuming you want lossy compression) would be to remove as much noise as you can before LPC and transmit the noise spectrum separately. Noise is not compressible. At the receiving end you generate random noise and filter it according to the spectral information you sent. -- Matt Mahoney
Reply by Jack May 18, 20062006-05-18
> I am not familiar with the details of your explanation. > Could you point me to some good resources. >
Sure..no problem...Two books I am reading at the moment: "Spoken Language Processing" by Xuedong Huang, Alex Acero and Hsiao-Wuen Hon. and "Discrete-Time Processing of Speech Signals" by John R. Deller Jr., John H. L. Hansen and John G. Proakis
Reply by banton May 18, 20062006-05-18
Jack

I would be interested in the literature you are refering to.
I have the feeling that there are some inconsistencies about
the exact implementation.  More simple minded explanations
usually just stick to an implementation that uses the reflection
coefficients with an all-pole lattice filter for the synthesis.
I am not familiar with the details of your explanation.
Could you point me to some good resources.

gr.
Anton


Reply by Jack May 18, 20062006-05-18
> > I am not sure, but why not? You find the spectral envelope of > the vocal tract. The only difference is that your excitation is > now a noise signal with flat spectrum instead of a pulse train > with flat spectrum.
I'm not sure if I agree. During unvoiced speech LPC analysis returns an estimate of the coefficients of an all-pole filter, but according to the speech model the filter that generates unvoiced speech is not an all-pole filter...it's a zero/pole filter....So the estimated envelope is not exactly the envelope of the vocal tract frequency response.... Maybe I should just forget about the mismatch between the model LPC analysis is based on and the speech production model?
Reply by banton May 18, 20062006-05-18

>The problem even becomes worse when LPC analysis is used >to estimate V based on noisy, pre-emphasized speech segments. >Then there is no justification for using LPC analysis to estimate >V. > >Or am I mistaken?
I am not sure, but why not? You find the spectral envelope of the vocal tract. The only difference is that your excitation is now a noise signal with flat spectrum instead of a pulse train with flat spectrum.
Reply by Jack May 18, 20062006-05-18
Hello,

I have read some literature about LPC analysis as a tool
for estimating the parameters of a source-filter speech model.

According to the model, voiced speech can be modelled
as s[k]=u[k]G[z]V[z]R[z] where G is a glottal pulse filter,
V is vocal-tract filter and R is a radiation filter. G has 2
poles. V has 10 poles. R has 1 zero that cancels one
of the poles of G. Pre-emphasizing voiced speech with
a filter P=R cancels the remaining pole of G such that voiced,
pre-emphasized speech is the response of the vocal-tract
filter alone. The vocal tract filter is said to be driven by
an excitation sequence which is a series of impulses with
pitch frequency 1/T. In the
frequency domain, the excitation signal can be regarded as
a sampled version of a truly white signal.

If analyze 30ms pre-emphasized, voiced speech segments
I can estimate the parameters of the V quite well depending
on T. The larger T is the better.

However, unvoiced speech is modelled as s[k]=u[k]V[z]R[z]
where u[k] is random noise. This model is a zero/pole model.
I can justify using LPC analysis by saying that this model which
has P poles can be approximated by an all-pole model with Q>P
poles. But then the model is no longer a model of the vocal tract
and the LPC estimates are no longer estimates of the coefficients
of V.

The problem even becomes worse when LPC analysis is used
to estimate V based on noisy, pre-emphasized speech segments.
Then there is no justification for using LPC analysis to estimate
V.

Or am I mistaken?