Technical discussions related to Speech Coding (all itu and other vocoders, ACELP, CELP, AMR, etc)
|
Hello. I know that speech could be reconstructed by spectrum, but the question is what quality should be expected? How exactly to reconstruct the phases of harmonics? Is it possible to give to every harmonic different phases in comparison with original sound spectrum (we will obtain different sound wave), but the sound to keep his original quality? (I meen not the simple cases, i.e. signal inversion, etc.) What freedom we have over the phases in this case? Is there some special permanent relation between phases we may use when we look at the coefficients of source synchronized DFT? Thank you in advance, Stefan. |
|
|
|
It is a well known fact that human ear is sensitive to frequency and not the phase! However phase distortions result in irregular propagation delays in different frequencies!. This tends to produce distortion... If the phase distortion is with in certain limit ( which will depend upon the application... Hi-fi systems will require it to be as little as possible while in communication quality systems you can take all the liberty) it shall not be a problem. This can also be observed as: 1. The encoding of phase is not done in most vocoders (LPC etc.) and they tend to produce mechanical sound. 2. In hybrid encoding (CELP etc.) some amount of phase is encoded, so they have relatively better performance! 3. The waveform coders give even better quality! (Well its not just phase, but phase have an important role in making speech sound natural or mechanical) -Bajwa wrote: > Hello. > > I know that speech could be reconstructed by spectrum, but the > question > is > what quality should be expected? How exactly to reconstruct the phases > > of > harmonics? Is it possible to give to every harmonic different phases > in > comparison with original sound spectrum (we will obtain different > sound > wave), > but the sound to keep his original quality? (I meen not the simple > cases, i.e. > signal inversion, etc.) > What freedom we have over the phases in this case? Is there some > special > permanent relation between phases we may use when we look at the > coefficients > of source synchronized DFT? > > Thank you in advance, > Stefan. |
|
|
|
This is an interesting issue and let me tell you some of my own experience. (but be patient and read to the end) Read these references about signal reconstruction from magnitude-only or phase only etc. 1) A. Oppenhiem and J. Lim, "The importance of phase in signals", Proceeding sof the IEEE, Vol. 69, No. 5, May 1981. 2) B. Yegnanrayana et al., "Significance of group delay functions in signal reconstruction from spectral magnitude or phase", IEEE Trans. ASSP, Vol. ASSP-32, No. 3, June 1984. (we can reconstruct a signal using only its magnitude spectrum if it satisfies certain conditions, for example, minimum phase signal. Similarily, we can reconstruct the signal using only its phase spectrum or group delay under certain conditions). Many researchers have studied the importance of phase on the quality of reconstructed signal. My own experiments prove that Fourier phase is important to retain a natural quality speech. One simple experiment is to use only the magnitude spectrum with any phase that replaces the true phase and study the quality of reconstructed signal. I have done this using the LP residual. I have noticed for speech phase of the LP residual is important to preserve naturalness. Moreover, for unvoiced speech using random phase is a good model with almost no effect on the quality. However, for voiced speech and mixed voiced random phase is not sufficient and not a good model. (read my paper for explanations for non-speech sounds such as background acoustic noise) K. El-Maleh and P. Kabal, "Natural-quality background noise coding using residual substitution", EuroSpeech 99. (www.tsp.ece.mcgill.ca) The main reason is that for voiced speech and any structured sound, the sequence of consecutive acoustic events produce the phase pattern. For example, it is important in Waveform Interpolation to preserve the pulse spacing between consecutive pitch pulses. (Kang and Sen, "Phase adjustment in waveform interpolation", ICASSP 99). Remark: The long-term phase (sequence of short-time phase spectra) is important perceptually. To read more about this see: 1) C. Ma, and D. O'Shaughnessy," A perceptual study of source coding of Fourier phase and amplitude of the linear predictive coding residual of vowel sounds", J. Acoust. Soc. America, 95 (4), April 1994, pp. 2231-2239. 2) P. Hedelin, "Phase compensation in all-pole speech analysis", ICASSP 88, pp. 339-342. 3) O. Gautherot et al. "LPC residual phase investigation", Proc. of EuroSpeech 89, pp. 35-38. For the CELP family of speech coders we have to remember that they use a closed-loop (analysis-by-synthesis) to find the excitation (model the LP residual). This waveform-matching is nothing more than preserving (modeling) both the magnitude and phase of the LP residual (signal). Read this paper for more details: T. Ramabadran and C. Lueck, "Complexity reduction of CELP speech coders through the use of phase information", IEEE Trans. Communications, Vol. 42, No. 2/3/4, Feb./March/April 1994, pp. 248-251. For CELP coders at and below 4 kbps, recently many researchers have proposed using an extra all-pass filter (phase addition filter) to compensate the insufficient modeling of phase in CELP with small codebooks. Read these papers: 1) Y. Yamaura et al., " CELP coding below 3 kbps using LPC residual phase coding", Speech Coding Workshop 97, pp. 103-104. 2) B. Cheetham eta l.," All-pass excitation phase modelling for low bit-rate speech coding", 1997 IEEE Int. Symposium in Circuits and Systems, June 1997, pp. 2633-2636. Recently, many arguments have appeared that do not support the well-known statement " the human ear is not sensitive to phase". This is not always true. Read the paper that appeared in the ICASPP 99 " On the Phase Perception of Speech", by W. Kleijn. For sinsuoidal coders, see this paper: S. Ahmadi, and A . Spanias, "A new phase model for sinusoidal transform coding of speech", IEEE Trans. on Speech and Audio Processing, Vol. 6, No. 5, Sept. 1998, pp. 495-501. Any comments are welcomed. ---------------------------------------------------------------- Khaled El-Maleh Department of Electrical & Computer Engineering McGill University 3480 University St. Montreal Quebec H3A 2A7 Canada Telephone: (514) 398-5233 (O) Fax : (514) 398-4470 |