Hi In sinusoidal coding of speech a conventional approach used is to find out the formant peaks from STFT (short time fourier transform). If i use the eigen analysis method to estimate the sinusoids (frequencies) would it be better in performance? I would like to know how would it perform both in terms time complexity and acuracy. I also would like to know how would it perform during noisy segments of speech. Well, in my simulation it did not do well for noisy segments of speech (or unvoiced sounds). I would like to know the reason for that. Regards Nithin
STFT vs Eigen analysis
Started by ●September 16, 2003
Reply by ●September 17, 20032003-09-17
nitin_hsn@yahoo.com (Nithin) wrote in message news:<96e5ea15.0309161158.15267f37@posting.google.com>...> Hi > In sinusoidal coding of speech a conventional approach used is to find > out the formant peaks from STFT (short time fourier transform). If i > use the eigen analysis method to estimate the sinusoids (frequencies) > would it be better in performance? I would like to know how would it > perform both in terms time complexity and acuracy. > > I also would like to know how would it perform during noisy segments > of speech. > Well, in my simulation it did not do well for noisy segments of speech > (or unvoiced sounds). I would like to know the reason for that.Estimating sinusoidals would not be expected to perform very well in speech processing. The reason is that few, if any, speech segments resemble pure sines (i.e. relatively few, very narrow lines). Traditional AR-type methods that represent broad-band features in the spectrum would be expected to work better. Rune
Reply by ●September 17, 20032003-09-17
allnor@tele.ntnu.no (Rune Allnor) wrote in message news:<f56893ae.0309170943.26cb9911@posting.google.com>...> nitin_hsn@yahoo.com (Nithin) wrote in message news:<96e5ea15.0309161158.15267f37@posting.google.com>... > > Hi > > In sinusoidal coding of speech a conventional approach used is to find > > out the formant peaks from STFT (short time fourier transform). If i > > use the eigen analysis method to estimate the sinusoids (frequencies) > > would it be better in performance? I would like to know how would it > > perform both in terms time complexity and acuracy. > > > > I also would like to know how would it perform during noisy segments > > of speech. > > Well, in my simulation it did not do well for noisy segments of speech > > (or unvoiced sounds). I would like to know the reason for that. > > Estimating sinusoidals would not be expected to perform very well in speech > processing. The reason is that few, if any, speech segments resemble pure > sines (i.e. relatively few, very narrow lines). Traditional AR-type methods > that represent broad-band features in the spectrum would be expected to > work better. > > RuneHi Rune Thanks for your reply. But i was specifically referring to the Sinusoidal coding of speech (or Sinusoidal Transform Coding) which is a low bit rate speech coding technique. With respect to that i wanted to know if identifying the peak formants from STFT would work better than eigen analysis method or not, both in terms of speed and accuracy.. one way to identify the formants using STFT is to sample the STFT at frequencies of the sinusoid peaks (formant frequencies). Thanks -Nithin
Reply by ●September 18, 20032003-09-18
Hi. Nithin> Hi Rune Thanks for your reply. But i was specifically Nithin> referring to the Sinusoidal coding of speech (or Sinusoidal Nithin> Transform Coding) which is a low bit rate speech coding Nithin> technique. With respect to that i wanted to know if Nithin> identifying the peak formants from STFT would work better than Nithin> eigen analysis method or not, both in terms of speed and Nithin> accuracy.. Subspace estimation tehcniques may perform very well in sinusoidal audio modeling and coding (it has been done and papers have been published on this). Psychoacoutics can be applied in both unitary ESPRIT (damped sinusoids) and ESPRIT (constant amplitude sinusoids). MUSIC is very sensitive to the choice of the number of sinusoids and may give you spurious peaks. Nithin> one way to identify the formants using STFT is to sample the Nithin> STFT at frequencies of the sinusoid peaks (formant Nithin> frequencies). The big problem with the STFT approach is picking the right number of sinusoids so you don't end up with picking sidelobes of windows. More refined approaches such as matching pursuit finds one sinusoid at the time and subtracts the contribution of that sinusoid from the signal. This procedure is then carried out until some convergence criterion is met. Matching pursuit can be seen as an iterative approximation to the nonlinear least-squares frequency estimator, which can be shown to have a lot of nice properties. In the end, it think, the choice of method may depend on computational complexity and how easily psychoacoustics can be accounted for in the method. Matching pursuit and its derivaties are analysis-by-synthesis methods and can work with many different distortion measures, whereas the subspace methods are closed-form solutions, where it may be more difficult to do so. -- /Mads (http://kom.auc.dk/~mgc)
Reply by ●September 18, 20032003-09-18
christensen@nospam.ieee.org (Mads G. Christensen) wrote in message news:<wky7k46e8na.fsf@leo.kom.auc.dk>...> Hi. > > Nithin> Hi Rune Thanks for your reply. But i was specifically > Nithin> referring to the Sinusoidal coding of speech (or Sinusoidal > Nithin> Transform Coding) which is a low bit rate speech coding > Nithin> technique. With respect to that i wanted to know if > Nithin> identifying the peak formants from STFT would work better than > Nithin> eigen analysis method or not, both in terms of speed and > Nithin> accuracy.. > > Subspace estimation tehcniques may perform very well in sinusoidal > audio modeling and coding (it has been done and papers have been > published on this). Psychoacoutics can be applied in both > unitary ESPRIT (damped sinusoids) and ESPRIT (constant amplitude > sinusoids).Interesting. Do you have refernces on this?> MUSIC is very sensitive to the choice of the number of > sinusoids and may give you spurious peaks.In good implementations MUSIC (and ESPRIT) estimate the number of sinusoidals present.> Nithin> one way to identify the formants using STFT is to sample the > Nithin> STFT at frequencies of the sinusoid peaks (formant > Nithin> frequencies). > > The big problem with the STFT approach is picking the right number of > sinusoids so you don't end up with picking sidelobes of windows. More > refined approaches such as matching pursuit finds one sinusoid at the > time and subtracts the contribution of that sinusoid from the > signal. This procedure is then carried out until some convergence > criterion is met. Matching pursuit can be seen as an iterative > approximation to the nonlinear least-squares frequency estimator, > which can be shown to have a lot of nice properties. > > In the end, it think, the choice of method may depend on computational > complexity and how easily psychoacoustics can be accounted for in the > method. Matching pursuit and its derivaties are analysis-by-synthesis > methods and can work with many different distortion measures, whereas > the subspace methods are closed-form solutions, where it may be more > difficult to do so.Eh... do you mean that MUSIC is a closed-form sinusoidal estimator? I would be very interested in seeing a refernence on this. Rune
Reply by ●September 18, 20032003-09-18
Hi Rune.>> Subspace estimation tehcniques may perform very well in sinusoidal >> audio modeling and coding (it has been done and papers have been >> published on this). Psychoacoutics can be applied in both unitary >> ESPRIT (damped sinusoids) and ESPRIT (constant amplitude >> sinusoids).Rune> Interesting. Do you have refernces on this? Sure, here are some: J. Jensen, R. Heusdens and S. H. Jensen, "A Perceptual Subspace Method for Sinusoidal Speech and Audio Modeling", Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, April, 2003 Renat Vafin, Richard Heusdens, Steven van de Par, W. Bastiaan Kleijn; Improved modeling of audio signals by modifying transient locations, Proc. 2001 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 143 - 146, October 2001. Jesper Jensen, S�ren Holdt Jensen, Egon Hansen; Exponential sinusoidal modeling of transitional speech segments, Proc. ICASSP'99, pp. 473 - 476, March 1999. Joost Nieuwenhuijse, Richard Heusdens, Ed F. Deprettere; Robust exponential modeling of audio signals, Proc. ICASSP'98, pp. 3581 - 3584, May 1998.>> MUSIC is very sensitive to the choice of the number of sinusoids >> and may give you spurious peaks.Rune> In good implementations MUSIC (and ESPRIT) estimate the number Rune> of sinusoidals present. Sure, but compared to the use of iterative approximations to NLS such as matching pursuit, the process of finding the right order is cumbersome. Rune> Eh... do you mean that MUSIC is a closed-form sinusoidal Rune> estimator? I would be very interested in seeing a refernence on Rune> this. Ohh, you're right, I was thinking of root MUSIC. Anyways, (root-) MUSIC has not received much attention in sinusoidal audio modeling and coding due to the sensitivity to choice of order compared to other methods. -- /Mads (http://kom.auc.dk/~mgc)
Reply by ●September 18, 20032003-09-18
Hi Rune. Rune> Estimating sinusoidals would not be expected to perform very Rune> well in speech processing. The reason is that few, if any, Rune> speech segments resemble pure sines (i.e. relatively few, very Rune> narrow lines). Traditional AR-type methods that represent Rune> broad-band features in the spectrum would be expected to work Rune> better. In the other posts, I just commented on the use of subspace techniques. Sinusoidal modeling of both speech and audio for low bit-rate coding applications has received much attention in research in recent years and the references are numerous (if you are intersted I can name the most important ones). The reason is simple: overcomplete signal decompositions based on sinusoidal bases converges much faster than transforms do in terms of rate-distortion. From a physical point of view, the use of sinusoidal models is also well-founded. Voiced (periodic) speech is indeed well modeled by a finite sum of sinusoids, whereas it works less well for noise-like signals such as unvoiced speech for exactly the reason you write. -- /Mads (http://kom.auc.dk/~mgc)
Reply by ●September 18, 20032003-09-18
On 16 Sep 2003, Nithin wrote:> In sinusoidal coding of speech a conventional approach used is to find > out the formant peaks from STFT (short time fourier transform). If i > use the eigen analysis method to estimate the sinusoids (frequencies) > would it be better in performance? I would like to know how would it > perform both in terms time complexity and acuracy.Other posters have answered your question already; here I have an off-topic remark: FFT is actually one kind of eigenanalysis. Specifically, the eigenvalues of the circulant matrix generated by x corresponds to the FFT of x. Here is a concrete example: octave:1> sort(abs(fft([1;2;3;4;5]))) ans = 2.6287 2.6287 4.2533 4.2533 15.0000 octave:2> sort(abs(eig([1 2 3 4 5;2 3 4 5 1;3 4 5 1 2;4 5 1 2 3;5 1 2 3 4]))) ans = 2.6287 2.6287 4.2533 4.2533 15.0000 Tak-Shing
Reply by ●September 18, 20032003-09-18
christensen@nospam.ieee.org (Mads G. Christensen) wrote in message news:<ur82eghkw.fsf@nospam.ieee.org>...> Hi Rune. > > >> Subspace estimation tehcniques may perform very well in sinusoidal > >> audio modeling and coding (it has been done and papers have been > >> published on this). Psychoacoutics can be applied in both unitary > >> ESPRIT (damped sinusoids) and ESPRIT (constant amplitude > >> sinusoids). > > Rune> Interesting. Do you have refernces on this? > > Sure, here are some:[..snipped..] Thanks. I'll try to get these.> >> MUSIC is very sensitive to the choice of the number of sinusoids > >> and may give you spurious peaks. > > Rune> In good implementations MUSIC (and ESPRIT) estimate the number > Rune> of sinusoidals present. > > Sure, but compared to the use of iterative approximations to NLS such > as matching pursuit, the process of finding the right order is > cumbersome.Estimating the order need not be cumbersome at all. The naive order estimators come in closed form analytical expressions and cost virtually nothing (in terms of computations) compared to the eigen vector decompoisitions or singular value decompositions that both MUSIC and ESPRIT rely on. I have seen a paper a few years back on more elaborate order estimators that were based on maximum likelihood type arguments, and these may very well be expensive. I haven't tried any of those, though.> Rune> Eh... do you mean that MUSIC is a closed-form sinusoidal > Rune> estimator? I would be very interested in seeing a refernence on > Rune> this. > > Ohh, you're right, I was thinking of root MUSIC. Anyways, (root-) > MUSIC has not received much attention in sinusoidal audio modeling and > coding due to the sensitivity to choice of order compared to other > methods.If you are interested in parametric models based on damped or undamped sinusoidals (ingeneral, no applications in psychoacoustics), take a look at the book Marple: "Digital Spectral Analysis with Applications in C, FORTRAN, and MATLAB", Prentice-Hall, 2003. It's a 2nd edition, due to be released these days (according to www.amazon.com), of a book that I found very useful for these types of problems. If you work with problems where root MUSIC is an alternative method, you may find the Tufts/Kumaresan method for linear prediction interesting. The method is reported in Tufts & Kumaresan: "Estimating Frequencies of Multiple Sinusoids: Making Linear Prediction Perform like Maximum Likelihood" Proc. IEEE, vol 70 no 9, pp 975-989, 1982. The implementation is virtually identical to root MUSIC, except for two tiny details. But those details make all the difference. And it's trivial to include (naive) order estimators. Rune
Reply by ●September 19, 20032003-09-19
Hi Rune. Rune> Estimating the order need not be cumbersome at all. The naive Rune> order estimators come in closed form analytical expressions and Rune> cost virtually nothing (in terms of computations) compared to Rune> the eigen vector decompoisitions or singular value Rune> decompositions that both MUSIC and ESPRIT rely on. I have seen a Rune> paper a few years back on more elaborate order estimators that Rune> were based on maximum likelihood type arguments, and these may Rune> very well be expensive. I haven't tried any of those, though. Really? I thought (at least in the methods I've heard about) that you still needed to do the actual decomposition for different orders and then combine the error with a penalty function for higher orders. Do you have some references on this? Another way of getting an appropriate estimate is to estimate jointly over a set of sinusoids and an AR process. Rune> If you are interested in parametric models based on damped or Rune> undamped sinusoidals (ingeneral, no applications in Rune> psychoacoustics), take a look at the book Rune> Marple: "Digital Spectral Analysis with Applications in C, Rune> FORTRAN, and MATLAB", Prentice-Hall, 2003. Cool, thanks. I'll have a look at that. Rune> Tufts & Kumaresan: "Estimating Frequencies of Multiple Rune> Sinusoids: Making Linear Prediction Perform like Maximum Rune> Likelihood" Proc. IEEE, vol 70 no 9, pp 975-989, 1982. Hmmm, I think I have that one somewhere in my office... -- /Mads (http://kom.auc.dk/~mgc)






