Hello, I have read some literature about LPC analysis as a tool for estimating the parameters of a source-filter speech model. According to the model, voiced speech can be modelled as s[k]=u[k]G[z]V[z]R[z] where G is a glottal pulse filter, V is vocal-tract filter and R is a radiation filter. G has 2 poles. V has 10 poles. R has 1 zero that cancels one of the poles of G. Pre-emphasizing voiced speech with a filter P=R cancels the remaining pole of G such that voiced, pre-emphasized speech is the response of the vocal-tract filter alone. The vocal tract filter is said to be driven by an excitation sequence which is a series of impulses with pitch frequency 1/T. In the frequency domain, the excitation signal can be regarded as a sampled version of a truly white signal. If analyze 30ms pre-emphasized, voiced speech segments I can estimate the parameters of the V quite well depending on T. The larger T is the better. However, unvoiced speech is modelled as s[k]=u[k]V[z]R[z] where u[k] is random noise. This model is a zero/pole model. I can justify using LPC analysis by saying that this model which has P poles can be approximated by an all-pole model with Q>P poles. But then the model is no longer a model of the vocal tract and the LPC estimates are no longer estimates of the coefficients of V. The problem even becomes worse when LPC analysis is used to estimate V based on noisy, pre-emphasized speech segments. Then there is no justification for using LPC analysis to estimate V. Or am I mistaken?
Does LPC analysis of noisy speech make any sense?
Started by ●May 18, 2006
Reply by ●May 18, 20062006-05-18
>The problem even becomes worse when LPC analysis is used >to estimate V based on noisy, pre-emphasized speech segments. >Then there is no justification for using LPC analysis to estimate >V. > >Or am I mistaken?I am not sure, but why not? You find the spectral envelope of the vocal tract. The only difference is that your excitation is now a noise signal with flat spectrum instead of a pulse train with flat spectrum.
Reply by ●May 18, 20062006-05-18
> > I am not sure, but why not? You find the spectral envelope of > the vocal tract. The only difference is that your excitation is > now a noise signal with flat spectrum instead of a pulse train > with flat spectrum.I'm not sure if I agree. During unvoiced speech LPC analysis returns an estimate of the coefficients of an all-pole filter, but according to the speech model the filter that generates unvoiced speech is not an all-pole filter...it's a zero/pole filter....So the estimated envelope is not exactly the envelope of the vocal tract frequency response.... Maybe I should just forget about the mismatch between the model LPC analysis is based on and the speech production model?
Reply by ●May 18, 20062006-05-18
Jack I would be interested in the literature you are refering to. I have the feeling that there are some inconsistencies about the exact implementation. More simple minded explanations usually just stick to an implementation that uses the reflection coefficients with an all-pole lattice filter for the synthesis. I am not familiar with the details of your explanation. Could you point me to some good resources. gr. Anton
Reply by ●May 18, 20062006-05-18
> I am not familiar with the details of your explanation. > Could you point me to some good resources. >Sure..no problem...Two books I am reading at the moment: "Spoken Language Processing" by Xuedong Huang, Alex Acero and Hsiao-Wuen Hon. and "Discrete-Time Processing of Speech Signals" by John R. Deller Jr., John H. L. Hansen and John G. Proakis
Reply by ●May 18, 20062006-05-18
Jack wrote:> Hello, > > I have read some literature about LPC analysis as a tool > for estimating the parameters of a source-filter speech model. > > According to the model, voiced speech can be modelled > as s[k]=u[k]G[z]V[z]R[z] where G is a glottal pulse filter, > V is vocal-tract filter and R is a radiation filter. G has 2 > poles. V has 10 poles. R has 1 zero that cancels one > of the poles of G. Pre-emphasizing voiced speech with > a filter P=R cancels the remaining pole of G such that voiced, > pre-emphasized speech is the response of the vocal-tract > filter alone. The vocal tract filter is said to be driven by > an excitation sequence which is a series of impulses with > pitch frequency 1/T. In the > frequency domain, the excitation signal can be regarded as > a sampled version of a truly white signal. > > If analyze 30ms pre-emphasized, voiced speech segments > I can estimate the parameters of the V quite well depending > on T. The larger T is the better. > > However, unvoiced speech is modelled as s[k]=u[k]V[z]R[z] > where u[k] is random noise. This model is a zero/pole model. > I can justify using LPC analysis by saying that this model which > has P poles can be approximated by an all-pole model with Q>P > poles. But then the model is no longer a model of the vocal tract > and the LPC estimates are no longer estimates of the coefficients > of V. > > The problem even becomes worse when LPC analysis is used > to estimate V based on noisy, pre-emphasized speech segments. > Then there is no justification for using LPC analysis to estimate > V. > > Or am I mistaken?I am not an expert on speech compression, but I think a better approach (assuming you want lossy compression) would be to remove as much noise as you can before LPC and transmit the noise spectrum separately. Noise is not compressible. At the receiving end you generate random noise and filter it according to the spectral information you sent. -- Matt Mahoney
Reply by ●May 21, 20062006-05-21
"Matt Mahoney" <matmahoney@yahoo.com> wrote in message news:1147966050.144318.315640@j73g2000cwa.googlegroups.com... | Jack wrote: | > Hello, | > | > I have read some literature about LPC analysis as a tool | > for estimating the parameters of a source-filter speech model. | > | > According to the model, voiced speech can be modelled | > as s[k]=u[k]G[z]V[z]R[z] where G is a glottal pulse filter, | > V is vocal-tract filter and R is a radiation filter. G has 2 | > poles. V has 10 poles. R has 1 zero that cancels one | > of the poles of G. Pre-emphasizing voiced speech with | > a filter P=R cancels the remaining pole of G such that voiced, | > pre-emphasized speech is the response of the vocal-tract | > filter alone. The vocal tract filter is said to be driven by | > an excitation sequence which is a series of impulses with | > pitch frequency 1/T. In the | > frequency domain, the excitation signal can be regarded as | > a sampled version of a truly white signal. | > | > If analyze 30ms pre-emphasized, voiced speech segments | > I can estimate the parameters of the V quite well depending | > on T. The larger T is the better. | > | > However, unvoiced speech is modelled as s[k]=u[k]V[z]R[z] | > where u[k] is random noise. This model is a zero/pole model. | > I can justify using LPC analysis by saying that this model which | > has P poles can be approximated by an all-pole model with Q>P | > poles. But then the model is no longer a model of the vocal tract | > and the LPC estimates are no longer estimates of the coefficients | > of V. | > | > The problem even becomes worse when LPC analysis is used | > to estimate V based on noisy, pre-emphasized speech segments. | > Then there is no justification for using LPC analysis to estimate | > V. | > | > Or am I mistaken? | | I am not an expert on speech compression, but I think a better approach | (assuming you want lossy compression) would be to remove as much noise | as you can before LPC and transmit the noise spectrum separately. | Noise is not compressible. At the receiving end you generate random | noise and filter it according to the spectral information you sent. | | -- Matt Mahoney OR a person could convert the speech into a image of the soundwave (scaled to match existing waveforms of the key syllables) and run a quick image compare routine with some stretching of the potential syllable with the comparative base syllable. The big problem with the simple image compare is logically within overlapping human speech (2 people talking at once). With a basic loose "Hit" compare percent of match to waveform (it is a cheap noise filter) then the routine is very fast without much computation The bonus is in that image compares are damn quicker now and computer memory is dirt cheap. Now if you want something with a cleaner hit rate, faster response, more accurate function, and truer-to-human sound recognition... use the same technique that a human ear uses. Filter the waveform into distinct frequency ranges over time. Then match loose image compares by hits along existing vocal frequency ranges to ignore noise factors. The human ear "hears" a large range of specific audio frequencies all at once over the standard microphone timeslice model. Each microhair in the inner ear "hears" only one distinct frequency range like an audio filter. If one of those microhairs dies, then a human loses that specific frequency range until the microhair is regrown. In this manner, musical audio would have zip in the way of speech-interferring noise because it rarely hits the specific vocal frequencies of speech. A word spoken with background audio music would "hit" a higher percentage of the comparitive images of normal human vowels. A true-frequency compare would filter out background noise, but since computers these days are highly image-focused there is no logical reason to ignore that available power for usage. A basic image compare along distinct vocal frequencies would logically be dramatically faster than waveform comparison with a much smaller compare range of speach-frequency images. Of course, you say to yourself, well, it would still have problems with two humans chatting in a room. Well, the truth is in that is where any logical application would REQUIRE more than one microphone. With multiple microphones in play, the distinct multiple vocal frequencies of 2 people would be filterable by strength of audio and would require another application which is prioritized with tracking in 3D space where these multiple human are physically in the room as they move about. Once that is done, the audio streams can be isolated and compared individually for each human sound emitter. Preferably a decent application would have up to 5 active microphones in play recording multiple audio streams filtered into a virtual 3D space so that humans can be tracked and their audio waveforms compared. The best routine for simplicity of application and function would be to go with "position aware" microphones each using an internal WI-FI locater to determine relative distances and specific positions (as one cannot assume that the microphones would be stationary and unmoved at all times). However, since ideal situations for quicker computation rarely come into play, the cheapest function is to assume that a basic cheapo single stereo microphone is in play and encourage the customer to upgrade to a superior audio recording system with options for that system available one it is installed. Considering that a basic 3D positional recognition routine comes with the better audio software, adapting it to superior vocal recognition functions would be relatively simple if one hires the right programmers. As you can guess, even a simple plastics factory worker (it pays the bills for my other hobbies) with an IQ of 150 can see the painfully obvious when it is presented.
Reply by ●May 21, 20062006-05-21
"Jack" <NOSPAM@THANK.YOU> wrote in message news:446c61ed$0$15794$14726298@news.sunsite.dk... | Hello, | | I have read some literature about LPC analysis as a tool | for estimating the parameters of a source-filter speech model. | | According to the model, voiced speech can be modelled | as s[k]=u[k]G[z]V[z]R[z] where G is a glottal pulse filter, | V is vocal-tract filter and R is a radiation filter. G has 2 | poles. V has 10 poles. R has 1 zero that cancels one | of the poles of G. Pre-emphasizing voiced speech with | a filter P=R cancels the remaining pole of G such that voiced, | pre-emphasized speech is the response of the vocal-tract | filter alone. The vocal tract filter is said to be driven by | an excitation sequence which is a series of impulses with | pitch frequency 1/T. In the | frequency domain, the excitation signal can be regarded as | a sampled version of a truly white signal. | | If analyze 30ms pre-emphasized, voiced speech segments | I can estimate the parameters of the V quite well depending | on T. The larger T is the better. | | However, unvoiced speech is modelled as s[k]=u[k]V[z]R[z] | where u[k] is random noise. This model is a zero/pole model. | I can justify using LPC analysis by saying that this model which | has P poles can be approximated by an all-pole model with Q>P | poles. But then the model is no longer a model of the vocal tract | and the LPC estimates are no longer estimates of the coefficients | of V. | | The problem even becomes worse when LPC analysis is used | to estimate V based on noisy, pre-emphasized speech segments. | Then there is no justification for using LPC analysis to estimate | V. | | Or am I mistaken? BZZZT!!! Wrong. You'd be better filtering vocal frequencies into their invidual frequency ranges (like the human ear does) and compressing those individual waveforms (as you can filter out noise completely this way) as a group.