DSPRelated.com
Forums

STFT vs Eigen analysis

Started by Nithin September 16, 2003
Hi
In sinusoidal coding of speech a conventional approach used is to find
out the formant peaks from STFT (short time fourier transform). If i
use the eigen analysis method to estimate the sinusoids (frequencies)
would it be better in performance? I would like to know how would it
perform both in terms time complexity and acuracy.

I also would like to know how would it perform during noisy segments
of speech.
Well, in my simulation it did not do well for noisy segments of speech
(or unvoiced sounds). I would like to know the reason for that.

Regards
Nithin
nitin_hsn@yahoo.com (Nithin) wrote in message news:<96e5ea15.0309161158.15267f37@posting.google.com>...
> Hi > In sinusoidal coding of speech a conventional approach used is to find > out the formant peaks from STFT (short time fourier transform). If i > use the eigen analysis method to estimate the sinusoids (frequencies) > would it be better in performance? I would like to know how would it > perform both in terms time complexity and acuracy. > > I also would like to know how would it perform during noisy segments > of speech. > Well, in my simulation it did not do well for noisy segments of speech > (or unvoiced sounds). I would like to know the reason for that.
Estimating sinusoidals would not be expected to perform very well in speech processing. The reason is that few, if any, speech segments resemble pure sines (i.e. relatively few, very narrow lines). Traditional AR-type methods that represent broad-band features in the spectrum would be expected to work better. Rune
allnor@tele.ntnu.no (Rune Allnor) wrote in message news:<f56893ae.0309170943.26cb9911@posting.google.com>...
> nitin_hsn@yahoo.com (Nithin) wrote in message news:<96e5ea15.0309161158.15267f37@posting.google.com>... > > Hi > > In sinusoidal coding of speech a conventional approach used is to find > > out the formant peaks from STFT (short time fourier transform). If i > > use the eigen analysis method to estimate the sinusoids (frequencies) > > would it be better in performance? I would like to know how would it > > perform both in terms time complexity and acuracy. > > > > I also would like to know how would it perform during noisy segments > > of speech. > > Well, in my simulation it did not do well for noisy segments of speech > > (or unvoiced sounds). I would like to know the reason for that. > > Estimating sinusoidals would not be expected to perform very well in speech > processing. The reason is that few, if any, speech segments resemble pure > sines (i.e. relatively few, very narrow lines). Traditional AR-type methods > that represent broad-band features in the spectrum would be expected to > work better. > > Rune
Hi Rune Thanks for your reply. But i was specifically referring to the Sinusoidal coding of speech (or Sinusoidal Transform Coding) which is a low bit rate speech coding technique. With respect to that i wanted to know if identifying the peak formants from STFT would work better than eigen analysis method or not, both in terms of speed and accuracy.. one way to identify the formants using STFT is to sample the STFT at frequencies of the sinusoid peaks (formant frequencies). Thanks -Nithin
Hi.

Nithin> Hi Rune Thanks for your reply. But i was specifically
Nithin> referring to the Sinusoidal coding of speech (or Sinusoidal
Nithin> Transform Coding) which is a low bit rate speech coding
Nithin> technique. With respect to that i wanted to know if
Nithin> identifying the peak formants from STFT would work better than
Nithin> eigen analysis method or not, both in terms of speed and
Nithin> accuracy..

Subspace estimation tehcniques may perform very well in sinusoidal
audio modeling and coding (it has been done and papers have been
published on this). Psychoacoutics can be applied in both
unitary ESPRIT (damped sinusoids) and ESPRIT (constant amplitude
sinusoids). MUSIC is very sensitive to the choice of the number of
sinusoids and may give you spurious peaks.

Nithin> one way to identify the formants using STFT is to sample the
Nithin> STFT at frequencies of the sinusoid peaks (formant
Nithin> frequencies).

The big problem with the STFT approach is picking the right number of
sinusoids so you don't end up with picking sidelobes of windows. More
refined approaches such as matching pursuit finds one sinusoid at the
time and subtracts the contribution of that sinusoid from the
signal. This procedure is then carried out until some convergence
criterion is met. Matching pursuit can be seen as an iterative
approximation to the nonlinear least-squares frequency estimator,
which can be shown to have a lot of nice properties. 

In the end, it think, the choice of method may depend on computational
complexity and how easily psychoacoustics can be accounted for in the
method. Matching pursuit and its derivaties are analysis-by-synthesis
methods and can work with many different distortion measures, whereas
the subspace methods are closed-form solutions, where it may be more
difficult to do so.

-- 
/Mads (http://kom.auc.dk/~mgc)
christensen@nospam.ieee.org (Mads G. Christensen) wrote in message news:<wky7k46e8na.fsf@leo.kom.auc.dk>...
> Hi. > > Nithin> Hi Rune Thanks for your reply. But i was specifically > Nithin> referring to the Sinusoidal coding of speech (or Sinusoidal > Nithin> Transform Coding) which is a low bit rate speech coding > Nithin> technique. With respect to that i wanted to know if > Nithin> identifying the peak formants from STFT would work better than > Nithin> eigen analysis method or not, both in terms of speed and > Nithin> accuracy.. > > Subspace estimation tehcniques may perform very well in sinusoidal > audio modeling and coding (it has been done and papers have been > published on this). Psychoacoutics can be applied in both > unitary ESPRIT (damped sinusoids) and ESPRIT (constant amplitude > sinusoids).
Interesting. Do you have refernces on this?
> MUSIC is very sensitive to the choice of the number of > sinusoids and may give you spurious peaks.
In good implementations MUSIC (and ESPRIT) estimate the number of sinusoidals present.
> Nithin> one way to identify the formants using STFT is to sample the > Nithin> STFT at frequencies of the sinusoid peaks (formant > Nithin> frequencies). > > The big problem with the STFT approach is picking the right number of > sinusoids so you don't end up with picking sidelobes of windows. More > refined approaches such as matching pursuit finds one sinusoid at the > time and subtracts the contribution of that sinusoid from the > signal. This procedure is then carried out until some convergence > criterion is met. Matching pursuit can be seen as an iterative > approximation to the nonlinear least-squares frequency estimator, > which can be shown to have a lot of nice properties. > > In the end, it think, the choice of method may depend on computational > complexity and how easily psychoacoustics can be accounted for in the > method. Matching pursuit and its derivaties are analysis-by-synthesis > methods and can work with many different distortion measures, whereas > the subspace methods are closed-form solutions, where it may be more > difficult to do so.
Eh... do you mean that MUSIC is a closed-form sinusoidal estimator? I would be very interested in seeing a refernence on this. Rune
Hi Rune.

>> Subspace estimation tehcniques may perform very well in sinusoidal >> audio modeling and coding (it has been done and papers have been >> published on this). Psychoacoutics can be applied in both unitary >> ESPRIT (damped sinusoids) and ESPRIT (constant amplitude >> sinusoids).
Rune> Interesting. Do you have refernces on this? Sure, here are some: J. Jensen, R. Heusdens and S. H. Jensen, "A Perceptual Subspace Method for Sinusoidal Speech and Audio Modeling", Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, April, 2003 Renat Vafin, Richard Heusdens, Steven van de Par, W. Bastiaan Kleijn; Improved modeling of audio signals by modifying transient locations, Proc. 2001 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 143 - 146, October 2001. Jesper Jensen, S&#4294967295;ren Holdt Jensen, Egon Hansen; Exponential sinusoidal modeling of transitional speech segments, Proc. ICASSP'99, pp. 473 - 476, March 1999. Joost Nieuwenhuijse, Richard Heusdens, Ed F. Deprettere; Robust exponential modeling of audio signals, Proc. ICASSP'98, pp. 3581 - 3584, May 1998.
>> MUSIC is very sensitive to the choice of the number of sinusoids >> and may give you spurious peaks.
Rune> In good implementations MUSIC (and ESPRIT) estimate the number Rune> of sinusoidals present. Sure, but compared to the use of iterative approximations to NLS such as matching pursuit, the process of finding the right order is cumbersome. Rune> Eh... do you mean that MUSIC is a closed-form sinusoidal Rune> estimator? I would be very interested in seeing a refernence on Rune> this. Ohh, you're right, I was thinking of root MUSIC. Anyways, (root-) MUSIC has not received much attention in sinusoidal audio modeling and coding due to the sensitivity to choice of order compared to other methods. -- /Mads (http://kom.auc.dk/~mgc)
Hi Rune.

Rune> Estimating sinusoidals would not be expected to perform very
Rune> well in speech processing. The reason is that few, if any,
Rune> speech segments resemble pure sines (i.e. relatively few, very
Rune> narrow lines). Traditional AR-type methods that represent
Rune> broad-band features in the spectrum would be expected to work
Rune> better.

In the other posts, I just commented on the use of subspace
techniques. Sinusoidal modeling of both speech and audio for low
bit-rate coding applications has received much attention in research
in recent years and the references are numerous (if you are intersted
I can name the most important ones). The reason is simple:
overcomplete signal decompositions based on sinusoidal bases converges
much faster than transforms do in terms of rate-distortion. From a
physical point of view, the use of sinusoidal models is also
well-founded. Voiced (periodic) speech is indeed well modeled by a 
finite sum of sinusoids, whereas it works less well for noise-like
signals such as unvoiced speech for exactly the reason you write.

-- 
/Mads (http://kom.auc.dk/~mgc)
On 16 Sep 2003, Nithin wrote:

> In sinusoidal coding of speech a conventional approach used is to find > out the formant peaks from STFT (short time fourier transform). If i > use the eigen analysis method to estimate the sinusoids (frequencies) > would it be better in performance? I would like to know how would it > perform both in terms time complexity and acuracy.
Other posters have answered your question already; here I have an off-topic remark: FFT is actually one kind of eigenanalysis. Specifically, the eigenvalues of the circulant matrix generated by x corresponds to the FFT of x. Here is a concrete example: octave:1> sort(abs(fft([1;2;3;4;5]))) ans = 2.6287 2.6287 4.2533 4.2533 15.0000 octave:2> sort(abs(eig([1 2 3 4 5;2 3 4 5 1;3 4 5 1 2;4 5 1 2 3;5 1 2 3 4]))) ans = 2.6287 2.6287 4.2533 4.2533 15.0000 Tak-Shing
christensen@nospam.ieee.org (Mads G. Christensen) wrote in message news:<ur82eghkw.fsf@nospam.ieee.org>...
> Hi Rune. > > >> Subspace estimation tehcniques may perform very well in sinusoidal > >> audio modeling and coding (it has been done and papers have been > >> published on this). Psychoacoutics can be applied in both unitary > >> ESPRIT (damped sinusoids) and ESPRIT (constant amplitude > >> sinusoids). > > Rune> Interesting. Do you have refernces on this? > > Sure, here are some:
[..snipped..] Thanks. I'll try to get these.
> >> MUSIC is very sensitive to the choice of the number of sinusoids > >> and may give you spurious peaks. > > Rune> In good implementations MUSIC (and ESPRIT) estimate the number > Rune> of sinusoidals present. > > Sure, but compared to the use of iterative approximations to NLS such > as matching pursuit, the process of finding the right order is > cumbersome.
Estimating the order need not be cumbersome at all. The naive order estimators come in closed form analytical expressions and cost virtually nothing (in terms of computations) compared to the eigen vector decompoisitions or singular value decompositions that both MUSIC and ESPRIT rely on. I have seen a paper a few years back on more elaborate order estimators that were based on maximum likelihood type arguments, and these may very well be expensive. I haven't tried any of those, though.
> Rune> Eh... do you mean that MUSIC is a closed-form sinusoidal > Rune> estimator? I would be very interested in seeing a refernence on > Rune> this. > > Ohh, you're right, I was thinking of root MUSIC. Anyways, (root-) > MUSIC has not received much attention in sinusoidal audio modeling and > coding due to the sensitivity to choice of order compared to other > methods.
If you are interested in parametric models based on damped or undamped sinusoidals (ingeneral, no applications in psychoacoustics), take a look at the book Marple: "Digital Spectral Analysis with Applications in C, FORTRAN, and MATLAB", Prentice-Hall, 2003. It's a 2nd edition, due to be released these days (according to www.amazon.com), of a book that I found very useful for these types of problems. If you work with problems where root MUSIC is an alternative method, you may find the Tufts/Kumaresan method for linear prediction interesting. The method is reported in Tufts & Kumaresan: "Estimating Frequencies of Multiple Sinusoids: Making Linear Prediction Perform like Maximum Likelihood" Proc. IEEE, vol 70 no 9, pp 975-989, 1982. The implementation is virtually identical to root MUSIC, except for two tiny details. But those details make all the difference. And it's trivial to include (naive) order estimators. Rune
Hi Rune.

Rune> Estimating the order need not be cumbersome at all. The naive
Rune> order estimators come in closed form analytical expressions and
Rune> cost virtually nothing (in terms of computations) compared to
Rune> the eigen vector decompoisitions or singular value
Rune> decompositions that both MUSIC and ESPRIT rely on. I have seen a
Rune> paper a few years back on more elaborate order estimators that
Rune> were based on maximum likelihood type arguments, and these may
Rune> very well be expensive. I haven't tried any of those, though.

Really? I thought (at least in the methods I've heard about) that you 
still needed to do the actual decomposition for different orders and
then combine the error with a penalty function for higher orders. Do
you have some references on this?

Another way of getting an appropriate estimate is to estimate jointly 
over a set of sinusoids and an AR process.

Rune> If you are interested in parametric models based on damped or
Rune> undamped sinusoidals (ingeneral, no applications in
Rune> psychoacoustics), take a look at the book

Rune> Marple: "Digital Spectral Analysis with Applications in C,
Rune> FORTRAN, and MATLAB", Prentice-Hall, 2003.

Cool, thanks. I'll have a look at that.

Rune> Tufts & Kumaresan: "Estimating Frequencies of Multiple
Rune> Sinusoids: Making Linear Prediction Perform like Maximum
Rune> Likelihood" Proc. IEEE, vol 70 no 9, pp 975-989, 1982.

Hmmm, I think I have that one somewhere in my office...

-- 
/Mads (http://kom.auc.dk/~mgc)